Why experienced developers move slower with AI coding tools — and how explainability is the one fix nobody is shipping.
A 2025 randomized controlled trial by METR found that experienced developers working on mature codebases were 19% slower with AI tools — yet believed they were 20% faster. The culprit isn’t the AI itself. It’s the invisible cognitive tax of verifying code you didn’t write. Stack Overflow’s 2025 survey confirmed developer trust in AI tools has fallen to just 29% — down 11 points in a single year. The fix isn’t more features or faster models. It’s explainability: giving developers the ability to see why the AI suggested what it did, not just what it suggested.
There’s a version of the AI coding story that everyone in the industry has agreed to tell. Developers are faster. Boilerplate disappears. Junior engineers punch above their weight. Productivity curves bend upward. It’s compelling, it’s everywhere, and — for a specific kind of developer doing a specific kind of work — it might even be true.
Then a group of researchers quietly ran an actual experiment. Not a survey. Not a vendor benchmark. A randomized controlled trial — the kind of rigorous methodology medicine uses to test whether drugs work. And the results did not fit the story.
Study Reference — METR, July 2025
16 experienced open-source developers. 246 real tasks. Randomly assigned to “AI allowed” or “AI disallowed.” Developers worked on their own mature repositories — codebases they had contributed to for an average of five years, with over one million lines of code and 22,000+ GitHub stars. Tools: primarily Cursor Pro with Claude 3.5/3.7 Sonnet — the frontier models at the time. Screen recordings validated self-reported times. The result: developers using AI tools took 19% longer to complete tasks than those working without them.
What makes this study different from the dozens of productivity reports floating around isn’t just the methodology. It’s who the participants were. These weren’t developers being handed an unfamiliar codebase and a new tool. These were project owners — people who had been living inside these repositories for years. The AI tools were supposed to help them. Instead, they slowed them down.
The Perception Gap Is the Real Story
The 19% slowdown is striking enough. But the number sitting right next to it is the one that should keep engineering leaders up at night.
Before the study began, developers were asked to estimate how much AI would speed them up. They predicted a 24% improvement. After completing the study — after actually experiencing the slowdown — they were asked again. Their revised estimate: AI had helped them be 20% faster.
They were wrong. In both directions. And they couldn’t tell.

This is a known cognitive pattern. When a tool feels productive — when tokens stream, when a block of code appears instantly, when the autocomplete is fluent and confident — the brain registers forward momentum. It mistakes the sensation of speed for actual speed. The AI is doing something visible and impressive. Therefore, we must be going faster.
But software engineering isn’t typing. It’s thinking. And the AI, for all its fluency, was generating code that required more thinking to verify than the developers would have spent writing it themselves.
“The developer receives a block of code and must reverse-engineer the logic to verify its correctness. This review phase often takes longer than the creation phase, especially for experienced developers who type fast but read code critically.”
The Four Hidden Taxes
The METR researchers, along with subsequent analysis from Baytech Consulting and code quality studies, identified the mechanisms behind the slowdown. None of them are obvious when you’re in the middle of a coding session. All of them compound.
1 The Reviewer’s Burden — When you write code yourself, you hold the entire mental model in working memory as you construct it. When AI generates it, you must reverse-engineer that model from the output. For complex code, this reconstruction takes longer than construction — and it breaks the flow you were in.
2 The Context-Switching Tax — Flow state research shows context switching incurs measurable time penalties. The cycle of Prompt → Wait → Review → Debug constantly breaks the mental model of the code. Experienced developers rely on Flow more than anyone — and pay the highest price when it’s interrupted.
3 The Confidence Trap — AI code looks polished. It follows naming conventions, uses fluent syntax, and reads like it was written by a competent developer. A CodeRabbit analysis found AI PRs produce 1.4–1.7× more critical and major findings than human PRs — precisely because surface-level correctness masks deeper errors.
4 The Deep Context Gap — Models infer code patterns statistically, not semantically. In the mature, complex repositories used in the METR study, AI had no feel for the system’s history, undocumented constraints, or architectural decisions made three engineers and four years ago. Developers had to supply that context manually — every time.
The Trust Data Makes It Worse
Here’s what makes this more than an interesting study result: it’s showing up in how developers actually feel about these tools, at scale.
Stack Overflow’s 2025 developer survey found that while AI tool usage has climbed to 84% — nearly universal adoption — developer trust in those tools has fallen sharply. Only 29% of developers say they trust AI-generated output, down 11 percentage points from 2024. That’s a counterintuitive trajectory. Normally, familiarity with a tool builds confidence in it. You learn the edge cases. You build a mental model of what it can and can’t do.
That’s not what’s happening here. The more developers use these tools, the less they trust them. A 2025 CodeRabbit analysis of 470 pull requests found AI-authored changes produced nearly 11 issues per PR compared to 6.5 for human-only PRs. Developers are learning — the hard way — that the confidence in AI’s output is inversely correlated with how carefully you’ve looked at it.
This Is an Explainability Problem
Let’s be precise about what’s actually happening in each of those four hidden taxes. The reviewer’s burden exists because the developer has no visibility into why the AI chose this approach over the alternatives. The confidence trap persists because there’s no signal distinguishing “AI is certain” from “AI is guessing.” The deep context gap is painful because there’s no way to see what context the AI actually used — or failed to use — when generating the suggestion.
Every single tax is, at its core, a transparency problem. The developer is handed output with no reasoning. They have to treat it like code written by an anonymous colleague who left no comments, no commit message, and no architectural notes.
Now imagine that same code, but with a sidebar: The AI chose this pattern because it matched three similar functions in your codebase. It’s uncertain about the error-handling path — there were conflicting examples in the context. It did not find relevant documentation for the third-party API call and defaulted to a common pattern that may be deprecated.
That’s not a nice-to-have. That’s the difference between spending 45 minutes debugging a plausible-but-wrong suggestion and spending 4 minutes making an informed decision to accept or reject it.

What Explainability Actually Looks Like for Code
Explainability in the context of AI code generation isn’t an abstract academic concept. It breaks down into several concrete mechanisms, each of which directly addresses one of the hidden taxes above.
Confidence signals, not just completions
The most immediate change is surfacing what the AI is uncertain about. A September 2025 OpenAI paper established that LLMs hallucinate because training rewards confident guessing over acknowledging uncertainty. The model has no internal incentive to say “I’m not sure about this.” The interface layer can supply that incentive — by requiring the model to produce calibrated uncertainty alongside its output. Not a raw probability score, but a human-readable signal: high confidence, well-supported by your codebase vs. uncertain — limited context for this pattern.
Reasoning traces as the review layer
Chain-of-thought prompting — asking the model to think step-by-step before answering — produces a reasoning trace that serves a dual purpose. It makes the model’s logic visible, and it makes it auditable. A developer who disagrees with a step in the reasoning can pinpoint exactly where the model went wrong, rather than treating the entire output as a black box to accept or reject wholesale. For AI code tools, surfacing these traces inline with the suggestion reduces the reviewer’s burden dramatically.
Context attribution
One of the most under appreciated explainability techniques for code is simply showing the developer what context the AI used to generate the suggestion. Which files it considered. Which patterns it matched against. Which parts of the specification (if any) anchored the output. This is especially critical in the large, mature codebases where the METR study found the greatest slowdown — repositories rich with implicit context that a statistically-reasoning model can’t automatically surface.

Why This Is Hard to Ship
If explainability is the fix, why isn’t it in every code assistant already? There are real engineering reasons, and it’s worth being honest about them.
Producing a coherent reasoning trace adds latency. In a streaming autocomplete interface, that’s friction developers will notice immediately. Confidence scoring at the token level is computationally expensive and, for very large outputs, adds up. And there’s a UX problem: how do you surface a reasoning trace without overwhelming the interface? Nobody wants a 400-word explanation appearing next to a 20-line function every time they tab-complete.
The answer isn’t to surface everything. It’s to surface the right things at the right time. A well-designed explainability layer is progressive — quiet when confidence is high, active when something warrants a second look. The model flags its own uncertainty; the interface amplifies that signal only when it crosses a threshold. The developer sees nothing unusual for routine completions, and gets a clear warning when the AI is operating outside its competence.
This is achievable with current model capabilities. It requires engineering will, product prioritization, and a willingness to accept that the metric “accepted completions per session” is the wrong thing to optimize for. The right metric is: how often does the developer discover an error before it makes it to production?
What This Means for Your Team Right Now
The METR study’s conditions — experienced developers, complex mature codebases, real tasks — describe a huge portion of professional software development. If your senior engineers are working on systems they know deeply, and they’re using AI coding tools without explainability, the 19% slowdown finding is a reasonable prior for what you might find if you ran your own experiment.
Three things worth doing today, without waiting for your code assistant vendor to ship an XAI layer:
Require the prompt in the PR. One buried insight from a 2026 consulting report: teams that include the AI prompt that generated code alongside the PR see dramatically faster reviewer comprehension. This is explainability implemented at the process level, not the tool level. It costs nothing and ships tomorrow.
Use chain-of-thought prompting deliberately. Ask your AI tools to explain their reasoning before generating. Most frontier models support this. The output is longer, but the reviewer’s burden drops. For complex, security-sensitive, or architecturally significant code, this trade-off is unambiguously worth it.
Measure trust, not just throughput. Track how often AI suggestions are accepted unmodified, how often they’re substantially edited before commit, and how often issues traced back to AI code appear in review or production. These numbers — not lines of code per day — tell you whether your AI tooling is actually working.
“The skill of 2026 is not writing a QuickSort algorithm. It is looking at an AI-generated QuickSort and instantly spotting that it uses an unstable pivot. That requires higher expertise, not lower — and it requires explainability to do efficiently.”
The Larger Point
The 19% slowdown study is, by the researchers’ own framing, a snapshot. The models are getting better. Agentic tools are evolving quickly. The METR team noted that one developer with more than 50 hours of Cursor experience actually saw positive speedup — suggesting there’s a real skill ceiling to learn, not just a capability floor to wait for.
But the perception gap is not going to close on its own. That requires transparency. The industry has spent three years building tools that are impressive to watch. The next three years have to be spent building tools developers can actually reason about — tools where “I don’t know why it suggested this” is never the answer.
The developers who will do their best work in an AI-assisted world aren’t the ones who accept the most suggestions. They’re the ones who know, at a glance, which suggestions to trust.
Explainability is how you give them that ability. Everything else — the faster models, the better context windows, the agent modes — is table stakes. This is the feature that turns a tool developers use into a tool developers rely on.
Sources: METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” July 2025 (arXiv:2507.09089) · Stack Overflow Developer Survey 2025 · CodeRabbit, “State of AI vs. Human Code Generation Report,” December 2025 · Baytech Consulting, “Mastering the AI Code Revolution in 2026” · Stack Overflow Engineering Blog, “Closing the Developer AI Trust Gap,” February 2026 · OpenAI (Kalai et al.), September 2025 hallucination training study.
The 19% Slowdown Paradox was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.