88% of AI agents never reach production. Here’s a framework for spotting the fakes before you waste months building on one.

88% of AI agents never make it to production.
Not because the technology doesn’t work. But because most of what’s being sold as “AI agents” aren’t agents at all.
The industry calls it agent washing — rebranding chatbots, RPA tools, and hardcoded automation scripts as “agentic AI” to ride the hype wave. Out of thousands of vendors claiming to sell AI agents, only about 130 are building genuinely agentic systems. The rest? Expensive chatbots with a new label.
I build multi-agent systems for SDLC automation. Over the past year, I’ve evaluated dozens of tools, built agents that actually ship to production, and watched plenty of “agents” crumble the moment they hit a real-world edge case. This article is the framework I use to tell the difference — and the patterns I’ve seen in agents that actually survive production.
By the end, you’ll have a concrete checklist to evaluate any AI agent claim, and a clearer picture of what production-grade agents actually look like under the hood.
What Agent Washing Actually Looks Like
Let’s start with the pattern. Agent washing typically shows up in three flavors:
The relabeled automation. A marketing platform that orchestrates email sequences based on fixed rules gets rebranded as an “agentic marketing system.” Nothing changed under the hood — just the pitch deck.
The chatbot upgrade. A customer service bot that routes tickets to humans based on keyword matching becomes an “autonomous support agent.” It’s still matching keywords. It’s still routing to humans.
The single-LLM wrapper. A tool that makes one API call to GPT, formats the response, and returns it. That’s an API call, not an agent. An agent makes 8–15 internal calls to reason, plan, execute tools, evaluate results, and iterate.
The common thread: none of these systems decide what to do next. They follow a script. When the script doesn’t cover the situation, they break.
Agent washing isn’t just a marketing problem — it has real consequences. A March 2026 survey of 650 enterprise tech leaders found that 78% have at least one agent pilot running, but only 14% have scaled an agent to production. That gap exists partly because teams are building on foundations that were never agentic to begin with.
The 5-Point “Is This Actually an Agent?” Checklist
After building and evaluating agent systems for the past year, I’ve landed on five criteria that reliably separate real agents from washed ones. I use this checklist when evaluating tools, reviewing architecture, or even auditing my own builds.
1. Does It Reason About What to Do Next?
A real agent receives a goal and decides its own sequence of steps. It doesn’t follow a fixed DAG or hardcoded workflow.
# Agent-washed: Fixed pipeline, no reasoning
def process_ticket(ticket):
summary = llm.summarize(ticket.description)
category = llm.classify(summary)
route_to_team(category) # Always the same three steps
# Real agent: Decides its approach based on context
def resolve_ticket(ticket, agent):
plan = agent.plan(
goal=f"Resolve ticket: {ticket.description}",
available_tools=[search_docs, query_db, check_logs, escalate],
constraints=["SLA: 4 hours", "Don't modify production data"]
)
# Agent might search docs first, or check logs first,
# or skip straight to escalation — depends on the ticket
result = agent.execute(plan)
return result
The difference is subtle in code but massive in behavior. The first version does the same three things every time. The second version adapts based on what it finds.
Test it: Give the system a task it hasn’t seen before. Does it figure out a path, or does it crash?
2. Does It Recover When a Step Fails?
This is where most “agents” expose themselves. A real agent handles failure as part of its workflow — retry with a different approach, fallback to an alternative tool, or gracefully degrade.
# Real agent failure recovery pattern
class AgentStep:
def execute_with_recovery(self, step, context):
try:
result = step.run(context)
if not result.meets_quality_bar():
# Agent DECIDES to try a different approach
alt_plan = self.replan(
failed_step=step,
reason=result.quality_issues,
context=context
)
result = self.execute(alt_plan)
return result
except ToolError as e:
# Agent switches tools, not just retries
fallback_tool = self.select_alternative_tool(
failed_tool=step.tool,
error=e,
available_tools=context.tools
)
return fallback_tool.run(context)
An agent-washed product crashes, returns garbage, or silently ignores the failure. A real agent treats failure as information and adjusts.
Test it: Deliberately break one of the system’s dependencies (rate-limit an API, corrupt a data source). Does it adapt or die?
3. Does It Complete Tasks End-to-End Without Hand-Holding?
Real agents take a goal and deliver a result. They don’t stop at every checkpoint to ask a human what to do next.
This doesn’t mean zero human involvement — human-in-the-loop is a valid and often necessary pattern, especially for high-stakes decisions. But there’s a difference between:
- Real HITL: Agent does 95% of the work, surfaces a decision point for human approval, then continues autonomously.
- Fake autonomy: System needs human input at every step, but the vendor calls each step “an agent action.”
Test it: Give it a multi-step task and walk away. Come back in 10 minutes. Is it done, blocked, or did it silently fail?
4. Does It Use Tools Dynamically?
A real agent selects and uses tools based on what the situation needs — not a hardcoded tool sequence.
# Agent-washed: Always uses the same tools in the same order
tools = [web_search, summarize, format_output] # Fixed chain
# Real agent: Selects tools based on the task
def select_tools(self, task, context):
if task.requires_data_lookup:
tools = [query_database, validate_results]
elif task.requires_research:
tools = [web_search, cross_reference, cite_sources]
elif task.requires_code:
tools = [read_codebase, generate_code, run_tests]
# Agent might add tools mid-execution if it discovers
# it needs something it didn't anticipate
return tools
The key signal: can the agent use a tool it wasn’t explicitly told to use for this specific task? If it can reason about its tool inventory and pick the right one, that’s real agency.
Test it: Give it a task that requires a tool combination it’s never seen before. Does it compose the right set?
5. Does It Handle Novel Inputs?
The demo always works. The question is: what happens with input the system has never encountered?
An agent-washed product falls apart outside its training distribution. A real agent applies its reasoning capability to novel situations — maybe not perfectly, but it doesn’t completely break.
Test it: Feed it an input that’s structurally different from the examples in the demo. Real agents degrade gracefully. Fake ones crash or hallucinate catastrophically.
What Production Agents Actually Look Like
Passing the checklist gets you to “real agent.” But there’s another gap between “real agent” and “production-ready agent.” Here’s what I’ve learned about what survives production:
They’re Observable
Every decision the agent makes is logged and traceable. Not just inputs and outputs — the intermediate reasoning steps, tool selections, and retry decisions. When something goes wrong at 2 AM, you need to reconstruct exactly what happened.
# Minimal observability pattern
class TracedAgent:
def execute(self, task):
trace_id = generate_trace_id()
self.logger.info(f"[{trace_id}] Goal: {task.goal}")
for step in self.plan(task):
self.logger.info(f"[{trace_id}] Step: {step.description}")
self.logger.info(f"[{trace_id}] Tool: {step.tool.name}")
self.logger.info(f"[{trace_id}] Reasoning: {step.reasoning}")
result = step.execute()
self.logger.info(f"[{trace_id}] Result quality: {result.score}")
if result.required_retry:
self.logger.warn(f"[{trace_id}] Retry reason: {result.retry_reason}")
return result
If you can’t trace a decision, you can’t debug it. And if you can’t debug it, it’s not production-ready.
They’re Cost-Controlled
Here’s a number that changes how you think about agents: an unconstrained agent solving a single software engineering task can cost $5–8 in API fees. Run thousands of tasks per day and that math gets brutal.
Production agents use model routing — expensive frontier models for complex reasoning, cheap models for routine execution. The Plan-and-Execute pattern is key here: use Claude to create the plan, then use smaller models to execute each step. This can cut costs by up to 90%.
They Fail Gracefully
Production systems don’t crash — they degrade. When an agent can’t complete a task, it should:
- Log what it tried and why it failed
- Return a partial result if possible
- Escalate to a human with full context
- Never silently return garbage
The “silently return garbage” failure mode is the most dangerous. I’ve seen agents confidently deliver wrong answers that looked right. That’s worse than crashing.
The Uncomfortable Truth
The pilot-to-production gap isn’t primarily a technology problem. The models are capable. The tooling has matured dramatically.
The gap is organizational.
Five root causes account for 89% of scaling failures: integration complexity with legacy systems, inconsistent output quality at volume, absence of monitoring tooling, unclear organizational ownership, and insufficient domain training data.
Companies that build with production constraints from the start — observability, cost controls, governance — reach production deployment at roughly 3x the rate of companies that prototype first and harden later.
The agent washing problem makes this worse. Teams spend months building on a foundation they were told is “agentic,” only to discover it’s a fancy chatbot that can’t handle edge cases, can’t recover from failures, and can’t scale beyond the demo.
Use the Checklist. Save Yourself Time.
Here’s the framework in one table:
+-------------------+----------------------------------------------+-------------------------------------------+
| Criteria | Real Agent | Agent-Washed Product |
+-------------------+----------------------------------------------+-------------------------------------------+
| Reasoning | Decides what to do next based on context | Follows a hardcoded script |
| Failure Recovery | Replans, retries with different approach | Crashes or returns garbage |
| End-to-End | Completes tasks autonomously | Needs human at every checkpoint |
| Tool Use | Selects tools dynamically | Uses fixed tool chain |
| Novel Inputs | Degrades gracefully | Breaks completely |
+-------------------+----------------------------------------------+-------------------------------------------+
Before you commit budget, headcount, or six months of engineering time to an “AI agent” — run it through these five tests. Feed it something it’s never seen. Break one of its dependencies. Walk away and come back.
If it survives, you might have a real agent. If it doesn’t, you’ve just saved yourself from becoming part of the 88%.
I’m an AI engineer building production multi-agent systems for SDLC automation. I write about what actually works (and what breaks) when you ship AI to production. Follow me for more hands-on production AI content.
Sources
- 88% of AI agents fail to reach production — Digital Applied, 2026
- 40% of agentic AI projects will be canceled by 2027 — AI Agent ROI 2026 Report
- Only ~130 of thousands of vendors are real — Particula Tech, 2026
- 78% have pilots, only 14% scaled to production — March 2026 Survey, Digital Applied
- 5 root causes account for 89% of scaling failures — ML Mastery, 2026
- Agent washing disclosure now a legal risk — Debevoise & Plimpton, March 2026
- Plan-and-Execute cuts costs up to 90% — AI Agent Architecture, Redis 2026
Agent Washing vs. Real Agents — A Production Engineer’s Guide to Telling the Difference was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.