
I don’t write code. I’ve never written code. I direct AI coding agents — Claude Code, mostly — and they build what I describe. Over the last few months, I’ve been building a series of single-task AI agents, each one proving a different idea about how autonomous software should work.
Agent 004 was a red team simulator. It attacked my own infrastructure from the outside — over HTTP, with its own identity, posting real collateral before every action. It ran 15 predefined attacks, then learned to adapt its strategy across rounds, then started writing its own novel attack code and executing it in a sandboxed child process. By the time it was done, it had thrown more than a hundred adversarial scenarios at the system and, in the tested runs, surfaced no exploitable paths.
The sandbox it used — four layers of defense including permission-restricted child processes, global nullification of dangerous APIs, and an IPC-only toolkit where the parent process handles all real operations — held up across every adversarial run I threw at it.
So I asked a different question: what if I took that same sandbox and used it for something constructive?
Because the most expensive bugs usually aren’t in the code. They’re wrong assumptions that get baked into a design and only surface after the system ships.
Act 1: Does This Code Do What It Should?
Agent 005 started as a recursive test generator. You point it at a module — a file full of functions — and it enters a loop: reason about what hasn’t been tested, generate JavaScript test code, execute it in the sandbox, measure what happened, then do it again.
The key word is recursive. Each round learns from the last one. If Round 1 discovers that your factorial function handles positive integers correctly, Round 2 doesn't waste time re-testing that. Instead, it asks: what about factorial(3.5)? What about factorial(-1)? What about factorial(Infinity)?
When I ran it against a sample strict-math module — functions designed to validate inputs and throw on invalid types — it found that factorial(3.5) silently returns 6 instead of throwing. No error, no validation, just a wrong answer that looks right. It found that isPrime(2.5) returns true. It found that clamp(NaN, NaN, NaN) returns 0 instead of rejecting the input.
None of these are hypothetical. The sandbox ran the actual function with the actual input and got the actual output. That’s not a reviewer’s opinion. It’s executable evidence.
That was v0.1.0. Fifty-five tests. Tagged and shipped.
Act 2: Is This Code Correct and Robust?
Testing inputs is valuable, but it’s not the same as understanding what’s wrong with the code. Version two shifted the question from “what inputs break this?” to “what do I think is broken — and can I prove it?”
The recursive code reviewer works differently from a fuzzer, which blindly throws inputs at boundaries hoping something breaks. Agent 005 reads the source code and forms a specific hypothesis: “I think this function doesn’t handle negative inputs.” “I think there’s a performance cliff with large arrays.” “I think this edge case produces a silent wrong answer.”
Then it writes a proof script. Not a test — a proof. The prove() method wraps a structured verdict: confirmed, refuted, or inconclusive, with the specific failure mode and evidence attached. A confirmed finding looks something like:
[confirmed] divide(Infinity, Infinity) → NaN
failure_mode: silent_wrong_answer
evidence: expected throw, got NaN
In the first smoke test, the reviewer ran three rounds against the same sample module. Round 1 cast a wide net — nine hypotheses, two confirmed. Round 2 adapted: it doubled down on the edge case patterns that worked and abandoned the performance hypotheses that kept getting refuted. Round 3 went deeper still.
Seven real findings across three rounds. Each one backed by executable proof, not a reviewer’s intuition.
The point wasn’t that it found exotic zero-days in a toy math file. The point was that it could form a falsifiable hypothesis, generate executable proof, and adapt its next round based on what held up.
Eighty-five tests. Tagged and shipped.
Act 3: Is This Design Sound Before I Build It?
Version three changes the target entirely.
Versions one and two both require existing code. You need a module to test, a file to review. But the most expensive bugs aren’t in the implementation — they’re in the design. The spec said the wrong thing, or didn’t say anything at all about an edge case that turns out to matter. By the time you find it, you’ve already built the system around a flawed assumption.
Version three doesn’t need code. It needs a spec.
You hand it an API specification — a markdown document describing endpoints, actors, permissions, and business rules — and it runs a three-phase pipeline:
First, it extracts a normalized summary from the raw spec: endpoints, actors, resources, state variables, business rules, invariants, allowed transitions, and — critically — unknowns. Places where the spec is silent.
Second, it generates a behavioral model. This is executable JavaScript that simulates the API: a state machine with handlers for every endpoint, invariant checkers, and explicit assumptions about the gaps in the spec. The model isn’t the real system. It’s a simulation of what the spec says the system should do.
Third, it generates adversarial attack sequences and runs them against the model. Authorization bypass attempts. State corruption through sequences of individually valid requests. Boundary violations. Ordering dependencies. Attacks on the assumptions the model had to make because the spec didn’t specify the behavior.
Each round, the model gets refined based on findings, and the attacker generates new sequences targeting uncovered surface area. But here’s the part that matters: the model tracks every modification in a change log, and each change is classified. Was it a legitimate bug fix? An ambiguity clarification? Or a suspicious adaptation — the model quietly adjusting itself to pass the attacks instead of revealing a real flaw?
That classification system helps make the model’s drift inspectable. Without it, the model could paper over real design problems by silently changing its behavior between rounds. With it, you can review the change log and catch the moments where the model might be hiding something.
When I ran it against a sample API spec for a user management system, it found authorization bypass paths — requests that should have been rejected but weren’t, because the spec didn’t explicitly forbid them. It found state corruption sequences where a series of valid calls left the system in an inconsistent state. And it found spec ambiguities — places where two reasonable interpretations of the same rule led to different security outcomes.
In other words, it wasn’t just finding bad inputs. It was finding valid-looking request sequences and missing authorization rules that could produce unsafe states.
All of this before a single line of the production system existed.
I want to be precise about what this is. It’s not formal verification — it doesn’t mathematically prove your design is correct. It’s recursive empirical verification: the model proposes, the sandbox executes, and the outputs are observable. But the abstractions still need human review. The behavioral model is an approximation of the spec, not the spec itself.
One hundred forty-nine tests. Tagged and shipped.
What It Can’t Do
The useful part is knowing exactly where the boundaries are.
The model can drift. Each round refines the behavioral model based on findings. The change log tracks modifications and classifies them, but it’s possible for the LLM to label a suspicious adaptation as a bug fix. The tracking mitigates drift — it doesn’t eliminate it. For high-stakes specs, a human should review the change log.
Fidelity is structural, not semantic. The fidelity checker verifies that every spec endpoint has a corresponding handler and every invariant has a check — by string matching. It doesn’t verify that the handler correctly implements the spec rule. It can tell you the model covers the spec’s surface area. It can’t tell you the model got the logic right. That still requires human review.
Severity is assisted, not authoritative. When the tool classifies a finding as “critical” or “high,” it’s using structural signals — did an invariant fail? Did an unauthorized request succeed? That’s a useful triage heuristic, but the actual business impact depends on context the tool can’t evaluate. It’s a starting point for human judgment, not a replacement for it.
The sandbox protects execution, not import. Generated test, proof, and attack code runs inside the four-layer sandbox. But the target module itself — the code you’re verifying in test and review modes — is loaded by the parent process. Modules with import-time side effects execute outside the safety boundary. In practice, Agent 005 works best on trusted local code and specs, not arbitrary untrusted packages.
The Bigger Point
Three modes, same engine: reason → generate → execute → score → repeat. The only thing that changes between versions is the level of abstraction.
Version one operates on behavior — does the code produce the right output? Version two operates on implementation — is the code built correctly? Version three operates on design — is the idea sound before you build it?
That progression — concrete to abstract, after the fact to before the fact — is what makes this more than a testing tool. It’s a verification framework that works at every layer of the stack, including the layer where the most expensive mistakes happen: the design phase, where the blueprint is still just words on a page.
Agent 004 caught bugs in my code. Agent 005 caught bugs in my thinking.
The code is open source: github.com/selfradiance/agent-005-recursive-verifier
This is part of an ongoing series about building AI agents with economic accountability. Previous articles cover AgentGate (the collateralized execution engine) and Agent 004 (the red team simulator that attacked it).
What If You Could Break Your API Design Before Writing a Single Line of Code? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.