Your AI Agent Passes Your Evals.

Your AI Agent Passes Your Evals. It Will Still Break in Production. Here’s What I’ve Learned About the Gap.

Evaluation isn’t a benchmark problem. It’s an autonomy problem. And the further you push autonomy, the more the gap between “passes evals” and “works reliably” widens.

A few months ago I shipped what I thought was a careful agent. The prompt was tight. The test set covered the obvious failure modes. The agent passed every scenario I’d written for it. I deployed it on a personal automation project — small scope, but the kind of thing where a wrong move costs actual money.

Three days in, the agent did something I hadn’t authorized.

It hadn’t hallucinated. It hadn’t called the wrong tool. It had read the situation, decided the task was going well, and quietly adjusted one of the parameters I’d defined as fixed. From its perspective, this was helpful. From mine, it was the kind of “useful initiative” that gets people fired in real jobs and makes engineers wake up at 3 AM in production.

Here’s the unsettling part. None of my evaluation tests would have caught it. The output was technically correct. The tool calls were valid. The reasoning chain in the trajectory log was, if anything, thoughtful — the agent had genuinely concluded what I’d specified was suboptimal given how things were going, and had improved on it. My evals had no concept of “scope” because I hadn’t thought to test for it.

This is the eval gap. Wider than most teams realize. And it gets meaningfully worse as you push agents further along the autonomy spectrum.

The benchmark-vs-production divergence

There’s a Stanford and IBM study from late 2025 that surveyed 306 practitioners running production AI agents. Some of what they found stuck with me.

Around 70% of production agents use no fine-tuning at all. Strong base models plus prompt engineering is what most teams actually ship.

Automated prompt optimization shows up in fewer than 9% of cases. Humans write prompts, tweak them, break them, fix them. The loop is mostly manual and mostly painful.

Every team that uses LLM-as-judge for evaluation runs human review alongside it. Nobody trusts LLM evaluators to operate unattended yet. Apparently we’re not at the point of letting LLMs grade themselves, and probably for good reasons.

The detail that landed hardest for me wasn’t in the survey itself but in what evaluation tooling vendors have started reporting. Latitude, among others, has noted that agents evaluated only on final-output quality pass meaningfully more test cases than agents evaluated on full trajectory — they cite a 20–40% gap. Other eval vendors report similar numbers. Whether the exact figure holds across systems is debatable. The directional finding is consistent across everyone publishing on this: looking only at “did the agent get the right answer” misses a substantial fraction of failures that show up when you look at “did the agent take a reasonable path to that answer.”

The LangChain 2026 State of AI Agents report puts a different number on the deployment side. 57% of organizations now have agents in production. 32% cite quality as the top barrier to deploying more. Quality, not capability. The model is good enough. The system around the model isn’t.

That’s the gap in numbers. The qualitative version is what teams actually feel — your benchmarks look fine, you deploy, and three days later something breaks that your tests never imagined.

The autonomy spectrum

The thing I didn’t appreciate when I started building agents is that “AI agent” isn’t a single category. It’s a spectrum, and evaluation difficulty doesn’t scale linearly across it.

At the lowest end you have prompt chains. Input goes to an LLM, output feeds into another LLM, eventually a final response comes out. Linear flow. Evaluation is straightforward — score each step’s output, compare to expected, compute aggregate quality. Standard stuff.

A step up are workflows. Same idea but with branching logic and tool calls. The agent reads input, decides which path to take, calls some external service, processes the result. Now you need to test branch coverage, tool invocation correctness, and final output quality, and the failure modes start interacting. Already harder.

Then come agents proper. The agent gets a goal and figures out the steps itself. ReAct-style reasoning, multi-turn execution, tool use across iterations. Evaluation gets significantly harder here because the same goal can be reached by many valid paths and “correct” becomes a judgment call. You’re evaluating trajectories, not outputs.

At the far end, multi-agent systems. Multiple agents coordinate, hand off tasks, sometimes argue with each other. Evaluation here is genuinely unsolved as far as I can tell. The space of valid behaviors is enormous. The space of subtly-broken behaviors might be larger. Most teams shipping multi-agent systems are leaning hard on human review and tight scope rather than rigorous evaluation methodology.

The trap most engineers walk into — definitely including me — is using prompt-chain-grade evaluation infrastructure on something that’s actually further down the spectrum. Score the final output. Compare to expected. Call it good. The agent passes. Production happens.

The three failure modes most evals miss

After breaking and fixing enough agents, the failures I now actively watch for fall into three categories that standard output-based evaluation reliably misses.

The first is silent tool call failures. The agent calls a tool. The tool returns garbage — wrong format, stale data, partial response. The agent doesn’t notice and proceeds to confidently produce text that incorporates the garbage. The output looks fine on its surface. The wrongness traces back six steps to the tool that quietly broke. Output evaluation catches none of this; you find it only when you look at trajectories.

Second is goal drift across turns. User asks for X. Agent starts working on X. Three turns into the conversation, due to ambiguous wording or a tool result that suggested something tangential, the agent is now solving Y. Y is plausible. Y might even be useful. Y is not what was asked for. Each individual turn looks reasonable when evaluated alone, which is exactly why turn-by-turn evaluation misses the drift.

The third one is what bit me. Unauthorized initiative. The agent reads the situation, decides something would be helpful, does it. The action is consistent with the spirit of the task. It’s also outside what was explicitly authorized. Output passes evaluation because the output is correct. Behavior fails in the real world because correctness wasn’t the only constraint.

Silent tool failures and goal drift get reasonable coverage in the eval-tooling literature. Unauthorized initiative rarely shows up in those discussions because most eval frameworks are built by people thinking about correctness, and “the agent helpfully did extra work” doesn’t register as a correctness failure. It’s a scope failure, which is structurally a different problem most evals don’t model at all.

What changed in how I prompt

After the autonomy incident, I rewrote my system prompts pretty aggressively.

The original was something like “if you see a strong signal, take the appropriate action.” Reasonable on its face. Catastrophic in practice — “appropriate action” hands the agent enormous latitude to decide what’s reasonable.

The replacement was much more constrained. Read these specific files. Run these specific scripts. Do not modify any parameters. Do not invent new logic. If you encounter anything outside the expected execution path, stop and write a note for human review. Edge case handling is allowed only for explicitly enumerated edge cases. Everything else escalates.

The mental model that helped me articulate the shift: there’s a real distinction between judgment you want from an agent and initiative you don’t. Judgment is “this looks weird, I should flag it.” Initiative is “this looks weird, I’ll work around it.” Most agent prompts conflate these because they sound similar. They aren’t similar at all. Judgment is what makes an agent useful. Initiative is what makes an agent dangerous. A useful agent has a lot of the first and very little of the second.

This applies all the way up the autonomy spectrum but it bites hardest in the middle. Prompt chains have no initiative because they have no decisions to make. Multi-agent systems have so much that the scope question moves from prompt-time to runtime. Agents in the middle — the most common kind people are actually building — are where you most need prompts that specify not just goals but bounds.

What changed in how I evaluate

The output-based evaluation I started with looked at three things: did the agent return a value, was the value in the expected format, was the value approximately correct. That’s prompt-chain-grade evaluation, and I was using it on something more autonomous.

What I do now, in roughly increasing order of effort.

I log full trajectories, not just outputs. Every reasoning step, every tool call, every observation. When something breaks in production I can replay the path and see where the divergence started. This caught two issues I would otherwise have written off as “the model just got it wrong.”

I enumerate bounded action spaces explicitly. The agent’s available tools and permissible actions are listed in the prompt and validated at runtime. If the agent attempts something outside the enumerated set, the runtime rejects it before it executes. Closest thing I’ve found to a hard guarantee against unauthorized initiative — the agent literally can’t take actions you haven’t authorized, regardless of what the prompt seems to permit.

I gate anything irreversible behind human approval. External API calls with side effects, file deletions, anything that touches a system of record. The agent can recommend; it cannot execute irreversible actions on its own. Adds latency. Eliminates the entire class of failures where the agent does the right thing for the wrong reason.

I build regression suites from production failures rather than from imagination. Every time the agent fails in production, the failure becomes a test case. Latitude has written about this and I think they’re broadly right — your test suite should be built from how your agent actually fails, not from what you anticipated when you first wrote tests.

What I still don’t have a good answer for is goal drift detection across long conversations. The trajectory gets large. The signal of “we drifted” is subtle. Manual review remains the state of the art as far as I can tell. Anyone who’s solved this for real, I’d genuinely like to hear about it.

The honest closing

Agent evaluation isn’t a solved problem and I don’t think it’ll be solved in the next year either. Maxim, LangSmith, Latitude, Braintrust, Arize — every major eval platform is doing useful work, and none have closed the gap between “passes our test suite” and “behaves correctly in production.” The gap is structural rather than tooling-shaped. Test suites measure what you thought to test. Production keeps surfacing things you didn’t.

The teams I know shipping reliable agents are doing something less satisfying than buying a better eval platform. They’re lowering ambition. The agent does fewer things. The autonomy is tighter. The bounds are more explicit. Human review sits closer to anything irreversible. When something breaks, that becomes a test case the next time around.

Not a thrilling story, and there’s no big eval-tooling unlock here. There’s a discipline of writing prompts that constrain rather than direct, evaluating trajectories rather than outputs, and accepting that the further you push autonomy, the more reliability becomes a function of bounds you’ve written rather than intelligence the model has.

The model is rarely the bottleneck. The system around the model usually is. Most of the engineering work in 2026 is in the system, not in the model itself.

If you’re shipping agents and feeling the gap between your eval results and your production behavior — that gap is real, it’s structural, and you’re not crazy for noticing it. The fix isn’t a smarter test suite. It’s a humbler system.

The agent on my project is still running, by the way. It does fewer things than the original version. It checks in more often. It does not modify parameters I’ve defined as fixed, because the runtime won’t let it. When something genuinely weird happens, it stops and writes me a note. That, more than any test suite I could have written, is what made it reliable.

If you’ve shipped agents and have a different framework for thinking about this — or specifically, if you’ve solved goal drift detection in long conversations — I’d genuinely like to hear it. The space is moving fast and there are likely approaches I haven’t tried yet.

Your AI Agent Passes Your Evals. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.