Every serious AI product team now has some version of an evaluation framework — a battery of tests that runs automatically, scores model outputs, and tells you whether your latest change made things better or worse. Evals are good. We should have them. But there's something important missing from the conversation, and it's the thing that evals can't provide: a hypothesis.
An eval tells you what happened. It does not tell you why, or whether you were asking the right question to begin with. A team can run thousands of evals and still be flying blind, because the underlying reasoning — the model of the world that says "we believe this change will improve this outcome because of this mechanism" — was never made explicit.
This isn't a new problem. It's the same one that plagued analytics-heavy product teams years before AI came along. Teams that invested heavily in dashboards often ended up with the same pattern: rich data, impoverished understanding. You can know that users drop off at step three, that session length decreased, that feature X has 20% adoption — and still have no idea why any of it is happening, or what to do next. The numbers are real. The interpretation is guesswork.
The hypothesis is what connects measurement to understanding. Without it, you're turning knobs and watching the needle.
What a good hypothesis actually looks like
A hypothesis isn't a guess — it's a structured claim: if we do X, then Y will happen, because Z. The Z is the key part. It's the mechanism. It's the theory of why the change should produce the effect. Without the mechanism, you're just predicting an outcome with no model behind it. You might be right. You won't know when you'll be wrong.
In AI product development, this matters more than most people realize, because the iteration surface is enormous. You can change the prompt, the context window, the retrieval strategy, the response format, the model, the fine-tuning data — any of these, in any combination. Without hypotheses, you're doing combinatorial exploration dressed up as engineering. You might find something that works. You won't know why. And you definitely won't know when it will stop working.
The research parallel is useful here. Good user research starts with a research question, not a research method. "What should we build for enterprise users?" is not a research question. "Do enterprise users treat this task as a checklist or a judgment call, and how does that change what they need from the system?" is a research question. It has a specific thing it's trying to learn, which means you can design a study that produces an answer rather than just an observation. Evals are a method. They need questions behind them.
Hypotheses as a forcing function
The discipline of forming hypotheses is also a discipline of prioritization. If you can't write down what you believe will happen and why, you probably don't have a clear enough view of the problem to be building yet. Forcing the hypothesis often surfaces disagreement within a team that was previously hidden by vague alignment — people thought they were working toward the same thing, but their mental models of why it would work were entirely different.
One practice I've found useful in product reviews: ask teams to narrate their hypotheses for a proposed change, not just their goals. "We want to improve task completion" is a goal. "We believe users are abandoning the task at step three because they're being asked to make a decision they don't have the context for, and if we surface that context earlier, completion will go up" is a hypothesis. The difference is accountability. The hypothesis can be tested. The goal can only be hoped for.
What eval culture gets right — and what it misses
The best eval cultures I've seen treat every eval run as a hypothesis confirmation or refutation, not just a pass/fail score. They write down what they expected before the run. They discuss the delta between expectation and outcome. They update their model of the system based on what they learned. It sounds slow. It's actually fast, because you stop making the same wrong turns twice.
Evals will keep getting better — faster, cheaper, more comprehensive. That's genuinely good news. But the limiting factor was never the measurement tooling. It was the thinking behind what to measure and why. That's the part that requires humans, and specifically humans who have developed the habit of asking: what do we actually believe, and how would we know if we were wrong?
Evals without hypotheses are a tool for optimization. Evals with hypotheses are a tool for understanding. The industry has gotten very good at the former. The latter is where the real leverage is.