Making AI work through eval hygiene

Anthropic, of all companies, just shipped three quality regressions in Claude Code that its own evals didn’t catch. Think about that. Three regressions over a short six weeks, by the most sophisticated eval shop in AI. If this can happen to Anthropic, it most definitely can happen to you, and it likely will.

In a refreshingly candid postmorten, Anthropic walked through what went wrong. On March 4, the team flipped Claude Code’s default reasoning effort from high to medium because internal evals showed only “slightly lower intelligence with significantly less latency for the majority of tasks.” On March 26, a caching optimization meant to clear stale thinking once an idle hour passed shipped with a bug that cleared it on every turn instead. On April 16, two innocuous-looking lines of system prompt asking Claude to be more concise turned out to cost 3% on coding quality, but only on a wider ablation suite that wasn’t part of the standard release gate.

From inside the org, none of it tripped a flag. Users, however, started complaining almost immediately. The lesson isn’t that Anthropic is careless. It’s that AI quality is slippery even for teams that obsess over measurement. For everyone else, vibes are a liability. So how can we fix this?

Stop shipping vibes

Andrej Karpathy coined the term “vibe coding” to portray the process of describing what you want, letting the model toil away, and trying not to look too closely at the resultant mess. That’s fine for prototypes, but it’s a terrible way to build production software. Unit tests, integration tests, regression suites, canary deploys: None of these became standard because developers love ceremony. They became standard because eventually the cost of guessing exceeded the cost of measuring.

AI is finally getting there, and Anthropic’s postmortem is the clearest signal yet that even the people building the underlying models can’t get away with shipping by feel. A lot of AI eval talk goes wrong by treating evals as a fancy new kind of test suite. They are, but only partly. A good eval is an argument about what quality means for your application. It forces a team to say, in advance, what good behavior looks like, what failure looks like, what trade-offs are acceptable, and what variance the business can tolerate.

The variance part is where most teams underestimate the problem. Anthropic’s eval guidance for agents draws a useful distinction between pass@k (the agent succeeds at least once across k tries) and pass^k (it succeeds every time across k tries). An internal triage tool that needs one good answer after a couple of retries can live with pass@k; a customer-facing workflow can’t. If a task succeeds 75% of the time, three consecutive successful runs drop to roughly 42%.

That isn’t some meaningless rounding error. No, it’s the difference between a demo and a product.

The other thing breaking the old playbook is that AI breaks the assumption traditional automation rests on. Angie Jones, who used to run AI tools and enablement at Block and now manages developer experience at the Agentic AI Foundation, has long argued that classical test automation assumes “the exact results must be known in advance” so you can assert against them. With machine learning, “there is no exactness, there is no preciseness. There’s a range of possibilities that are valid.” She is equally direct about the developer side: “Vibe coding is cute and all, but it’s risky when you’re building production apps. Just because we’re using new methods doesn’t mean our old ones are obsolete.”

She’s exactly right. AI doesn’t eliminate engineering discipline. Instead, it raises the price of overlooking it.

Anthropic’s own guidance reflects all of this. Agents are “fundamentally harder to evaluate” than single-turn chatbots because they operate over many turns, call tools, modify external state, and adapt based on intermediate results. And so the guidance is to grade outcomes, transcripts, tool calls, cost, and latency as separate dimensions, while running multiple trials and keeping capability evals cleanly separated from regression evals (which should hold near 100% and exist to prevent backsliding).

The improvement loop

The shape of a working improvement loop is starting to converge across vendors. LangChain’s April update shipped more than 30 evaluator templates covering safety, response quality, trajectory, and multimodal outputs, plus cost alerting and a serious push toward human judgment in the agent improvement loop. Karpathy’s autoresearch experiment, in which an agent ran 700 experiments over two days against its own training code with binary keep-or-revert decisions, makes the same point in a different way. Most AI developers underinvest in measurement, and the eval is the product.

Strip away the tools and the loop is simple: Production complaint becomes trace, trace becomes failure mode, failure mode becomes eval, eval becomes regression test, and regression test becomes release gate. Then, and only then, do you change the prompt, swap the model, adjust the retrieval strategy, or tune the cost/latency trade-off.

By contrast, most teams are doing this loop in reverse, or not at all. That’s bad.

Nor is it helped by the current charade many teams try. For example, a team buys into LangSmith (good!), wires up a few trajectory evaluators, points an LLM-as-judge at outputs, and ships a green dashboard. Seems great, right? After all, the dashboard is green, therefore the agent is good. Right? Well…. You can spoof a dashboard, but you can’t spoof what users actually experience. Hence, someone in product review may say, “The agent feels dumber.” Because it is. Pointing to the dashboard and saying, “But the evals are green” does nothing but demonstrate denial at scale. 

Bad evals create false confidence, which is worse than no confidence. If your evals are too narrow, teams optimize to them. If your graders are brittle, they punish valid solutions and reward shallow compliance. If you rely entirely on LLM-as-judge without calibration against human review, you’ve moved the vibes one level down without removing them. If your eval set never changes, it becomes a living cemetery of old assumptions.

Notice what’s missing from a good eval: “Did the answer sound good?” Sounding good is the easiest thing modern models do. It’s what probabilistic systems designed to mimic truth without actually knowing truth, do. It’s also the least useful quality signal you can collect. A confident agent that took the wrong tool path is dangerous.

One of the more interesting parts of the Anthropic postmortem is that the regressions came from sensible changes. Reducing latency is good, as is reducing verbosity (or it can be). Ditto better caching. Nobody sits in a product meeting and says, “Let’s make the coding agent worse.” They say, “Users hate waiting” or “We’re burning too many tokens,” and they’re right. But that right doesn’t justify the wrong of shipping a regression.

This is why AI teams need to stop treating quality, latency, and cost as a single blended metric. These are trade-offs, not synonyms. For example, a concise answer may be better for a status update but worse for a code review. Similarly, a lower-effort reasoning mode may be perfect for boilerplate but damaging for multi-file refactors. A cost optimization should have to prove it didn’t damage quality, and a prompt change should have to prove it didn’t damage behavior.

So what should we do?

If you’re a tech leader sitting at the intersection of “we have an agent in production” and “we’re not sure our evals are doing anything,” there’s hope and some clear guidelines of what to do next.

First, treat user complaints as your most valuable eval input. Every Slack message that says “Claude got dumber” or “the agent forgot what we just told it” is a test case waiting to be written. Anthropic’s mistake wasn’t a lack of eval infrastructure. It was the lag between user signal and eval coverage (two weeks). If your fastest path from production complaint to regression suite is measured in weeks, you have a process problem, not a tool problem.

Second, write fewer, better evals, and read every transcript. Anthropic’s recommendation of 20 to 50 tasks drawn from real failures is the right shape, because you don’t need a thousand synthetic test cases. You just need a few dozen pulled from production incidents, graded with a mix of code-based checks (for what you can deterministically verify) and LLM-as-judge calibrated against human review (for what you can’t).

Third, be sure to encode your product’s values in the eval. If you’re building a coding assistant, then you care about passing tests, preserving style, avoiding security mistakes, and not bulldozing through a repo. If you’re building a customer-support agent, by contrast, your concern shifts to factuality, tone, escalation, policy compliance, resolution rate, and whether the system created new problems while solving the old one. Generic “helpfulness” graders won’t capture any of that.

Fourth, make regression a release gate, instead of a release report. If a change drops a regression score, don’t ship the change. As I’ve argued before, the agents that survive in the enterprise are the ones that do a few things reliably and predictably, and the only way you get there is by refusing to deploy anything that breaks what already worked.

Finally, write the eval before the prompt. You need to be able to articulate what good looks like before you start tweaking the system. The prompt is the means to an end, and the eval captures that end in advance.

Moving beyond demo-ware

We’re still so early in our AI engineering journeys that perhaps we can be forgiven for mistaking vibes-driven demos for progress. However forgivable now, this won’t last. As Jones recently put it, “a lot of the problems people blame on AI are actually problems that always existed, AI just amplified them.” Evals are how you stop amplifying them.

The teams that win the next phase of this won’t be the ones with the most elaborate eval dashboards; they’ll be the ones with the most honest feedback loops. They’ll know which failures matter, and when a model upgrade helped and when it quietly broke a workflow. They’ll know, in short, when their agent is actually getting better.

Evals aren’t sexy, but they lead to sexy, production-ready systems.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top