The Loop: How an AI Swarm Surfaced a Governance Limitation, Then Tested the Fix

AgentGate is a runtime accountability layer for AI agents: before an agent can execute a high-impact action, it must lock a bond as collateral. Good outcomes release the bond. Bad outcomes slash it. The mechanism makes bad behavior economically irrational.

In March 2026, a coordinated swarm of nine AI agents ran 97 attacks against AgentGate. One team — Beta — spent 48 clean bond cycles building reputation and earned nothing for it. Bond capacity was mathematically enforced but not reputation-gated: a brand-new identity could lock the same bond-locking capacity as one with a spotless track record.

The original swarm campaign classified this as a governance limitation, not a vulnerability. AgentGate’s core defenses held. Gamma maintained a 100% catch rate across all 38 of its attacks. The campaign produced zero swarm-emergent findings.

We responded by implementing the Progressive Trust Model — reputation-gated bond capacity enforced at bond-lock time. Then we re-ran the swarm campaigns against the hardened system. The swarm verified tier promotion, distinct-resolver gating, and malicious demotion. The static Tier 2 bond cap (500¢ accepted, 501¢ rejected with TIER_BOND_CAP_EXCEEDED) was confirmed separately through a targeted direct test, because the Claude-powered strategist never selected the high-value bond probe across any of the three runs.

This article shows what that loop looked like end to end — and where the verification is incomplete.

Act 1 — The Limitation

The canonical Agent 004 v0.5.0 swarm campaign ran 9 agents across 3 teams for 5 rounds: Alpha (reconnaissance, 37 attacks), Beta (trust exploitation, 22 attacks), and Gamma (coordinated economic pressure, 38 attacks).

Of those 97 attacks, 77 were caught and 20 were uncaught. But the uncaught attacks didn’t map to the trust-model issue. They fell into categories AgentGate was already aware of — for example, unauthenticated GET endpoints that return system information by design, and recon-category boundary probes targeting surfaces outside the bond-enforcement layer. None of the 20 represented newly discovered vulnerabilities. Gamma’s economic pressure attacks — the category most relevant to bond integrity — were caught at a 100% rate (38 of 38). Zero of the 20 uncaught attacks were classified as swarm-emergent.

The real finding was structural: Beta’s trust-building phase completed 48 clean bond cycles, and the system treated that identity identically to one created five seconds ago. No elevated bond-locking capacity was earned. The system could not distinguish newly created identities from proven ones when deciding how much bond capacity to grant — it enforced accountability but couldn’t differentiate at the point of economic delegation.

The canonical run documented this as an enhancement opportunity, not a vulnerability.

Act 2 — The Fix

AgentGate now computes reputation tiers at bond-lock time via computeTrustTier(), with no stored tier state:

Tier 1 “New” (default): 100¢ bond cap.
Tier 2 “Established”: 500¢ bond cap. Requires 5 or more qualifying successes from 2 or more distinct resolvers, with zero malicious resolutions.
Tier 3 “Trusted”: No tier cap (normal capacity rules apply). Requires 20 or more qualifying successes from 20 or more distinct resolvers, with zero malicious resolutions.

A single malicious resolution forces immediate demotion to Tier 1.

Then the Codex cold-eyes audit caught a cheap-farming loophole before it shipped. The initial implementation let an agent rack up qualifying successes with trivial-exposure actions. Three hardening measures were added:

Minimum exposure threshold. Only actions with 100¢ or more of effective exposure (after the 1.2× risk multiplier) count as qualifying successes. Under the Tier 1 cap of 100¢, the tightest valid path is declaring 83¢ exposure — ceil(83 × 1.2) = 100¢ effective, exactly meeting the threshold. This means cheap farming is closed, but Tier 1 agents still have a narrow valid path to earn qualifying successes and advance — a deliberate design tradeoff.

Distinct resolver requirement. Each qualifying success must come from a different resolver identity. This raises the bar against trivial self-dealing and low-identity collusion. It does not prevent collusion in any strong sense — coordinated identities controlling enough resolvers could still game the system — but it makes the cheapest farming paths unviable.

Self-resolution forbidden. An identity cannot resolve its own actions (403 SELF_RESOLUTION_FORBIDDEN).

A methodological caveat about the distinct-resolver threshold. The Tier 2 threshold was originally set at 5 distinct resolvers. During re-testing, this was lowered to 3, then to 2. The reason: a 3-agent team can only produce 2 distinct resolvers per executor (the other two teammates). With the original threshold of 5, Tier 2 was unreachable regardless of how many qualifying successes an agent accumulated. The threshold was reduced to match the maximum resolver diversity the test environment’s team topology can produce. The mechanism itself — requiring multiple independent resolvers — is the defense against collusion. The threshold number scales with deployment size. A production system with more participants would use a higher threshold.

AgentGate’s tier system passed 121 tests across 12 suites.

Act 3 — The Re-Test

I reran three full swarm campaigns against the hardened system. The goal was to verify three specific enforcement mechanisms.

Swarm-Verified: Tier Promotion, Gating, and Demotion

Beta-2 reached Tier 2 in runs 2 and 3, accumulating 8 qualifying successes from 2 distinct resolvers with zero malicious resolutions. Promotion through the distinct-resolver gate confirmed working.

Beta-1 accumulated 8 qualifying successes but was blocked at Tier 1 — only 1 distinct resolver. The gating rule held. Same number of successes, different outcome, because resolver diversity was insufficient.

Beta-3 was immediately demoted to Tier 1 after a malicious resolution. Demotion confirmed working.

Gamma maintained a 100% catch rate across all three re-test runs. Its economic-pressure attacks were caught in every run regardless of tier state.

Directly Verified: Tier 2 Bond Cap

The Claude-powered strategist never selected the highValueBondAttempt task across any of the three swarm runs. This is worth flagging because it's evidence that LLM-driven campaign selection can miss relevant tests even in repeated runs — the strategist had no strategic reason to attempt a high-value bond before confirming Tier 2 promotion, and only one agent reached Tier 2.

The Tier 2 bond cap was verified through a targeted direct test: a 500¢ bond locked successfully; a 501¢ attempt was rejected with TIER_BOND_CAP_EXCEEDED.

Did the tier system change attack outcomes?

No. The caught/uncaught ratio reflects AgentGate’s existing detection mechanisms, not the tier system. What tiers add is economic constraint on new identities — limiting how much bond-locking capacity an unproven agent can access, not improving the catch rate itself.

Agent 004’s re-test passed 330 tests across 34 files. Zero swarm-emergent findings across all runs.

What This Does Not Prove

Tier 3 is untested. It requires 20 distinct resolver identities. The 9-identity swarm cannot reach it. Testing Tier 3 requires a larger campaign — that’s a future step.

This does not prove general economic containment. Gamma’s 100% catch rate applies to the tested attack families in the tested campaign structure. It is not a universal proof.

No novel vulnerability class was discovered. The swarm found zero swarm-emergent findings across all runs. That may reflect a genuinely secure system, or it may reflect the constraints of the attacker: 9 agents, a predefined attack library, and an IPC-restricted sandbox. Both interpretations are honest.

LLM-driven campaign selection has blind spots. The strategist never selected the high-value bond probe across three full runs. Autonomous attack orchestration cannot be assumed to cover all relevant test paths — targeted testing remains necessary alongside campaign-driven verification.

No independent third-party verification. The re-test was run by the same Agent 004 codebase that surfaced the original limitation. The architectural isolation is real — HTTP-only boundary, no database access, separate identities, sandbox constraints — but it is not the same as an external red team.

The distinct-resolver threshold (2) was calibrated to the test environment. Production deployments with more participants should use a higher threshold.

The Loop

This is a concrete, end-to-end example of the red-team → defense → re-test loop applied to AI agent governance:

An AI swarm campaign surfaced a governance limitation: bond capacity was enforced but not reputation-gated. A trust-gated bond model was implemented and hardened against farming and self-dealing. The same swarm campaigns were re-run to verify specific enforcement mechanisms — with explicit acknowledgment that one key mechanism required direct testing because the autonomous strategist didn’t select it. Limitations and untested branches were documented explicitly.

The loop is human-directed. I chose to build the fix, I chose to re-run the swarm, I made the calibration decisions. The AI agents handled the implementation and the attack execution. The entire build — AgentGate’s tier system, the hardening, the re-test campaigns — was directed by someone who doesn’t write code, using AI coding agents for all implementation.

The value of showing this loop publicly isn’t any claim of invincibility. It’s that the full cycle is documented with real data, real code, and explicit acknowledgment of what remains unproven.

The code for Agent 004 and AgentGate is on GitHub. The previous articles cover Stages 1–4 and Stage 5 (Swarms).

The Loop: How an AI Swarm Surfaced a Governance Limitation, Then Tested the Fix was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.