Prose Is a Suggestion. Agent Harnesses Need Cages.

MCP, MCP everywhere, and not an enforcer in sight.

The prevailing thesis in 2026 is that enterprise SaaS becomes the backend, and the agent harness (e.g., Claude Cowork) becomes the app. It’s a reasonable bet for a class of productivity work. It’s the wrong bet, or at least a premature one, for any task where getting the steps right matters more than producing fluent output.

What the courts show about where AI landed first

A March 2026 working paper by Anand Shah (MIT) and Joshua Levy (USC) documents AI’s first large-scale effect on a regulated system. Drawing on 4.5 million federal civil cases and 46 million PACER docket entries, they show that the pro se filing rate in federal courts, steady at 11% for two decades, jumped to 16.8% by FY2025. The increase is almost entirely in case types with standardized complaint templates: civil rights, consumer credit, foreclosure. Cases requiring specialist credentials or claim-specific pleading regimes (patent bar admission, PSLRA particularity, qui tam seal rules) barely moved.

The split the paper documents is templateable vs specialist. The mechanism is a Becker home-production model where AI drops the cost of drafting enough that for templateable work, self-representation clears the threshold. For specialist work it doesn’t.

The paper studies what happened, not what could happen. It measures the boundary as of 2026 with general-purpose LLMs used in chat mode by non-experts. That boundary moves inward with better tooling. A workflow with retrieval over regulatory corpora, rule-engine validation of pleading requirements, programmatic checks against statute, and human-in-the-loop gates can produce securities complaints that consumer chat cannot. The specialist category has a structural component that responds to workflow sophistication and a credentialing component that doesn’t, because no tool creates a license.

The paper’s boundary is behavioral, not a ceiling. It shows where AI landed first. It doesn’t show where AI stops.

The harness question the paper doesn’t ask

Moving the boundary inward requires infrastructure the paper doesn’t study. You need skills that know the domain, tools the skills can invoke to enforce correctness, context management that keeps hard constraints in scope for long tasks, audit state that persists outside the agent’s working context, and clean human handoffs at the points regulation requires a licensed attestation.

This is the agent harness. Claude Cowork is one. Cursor’s agent mode is another. The push across the industry is to make the harness the platform, with MCP as the tool-exposure layer and skills as the instruction layer.

The pitch is that this is simpler. It isn’t. It relocates complexity.

Prose skills can’t enforce control flow

A skill file says “before drafting the submission, validate against the pleading checklist.” An agent reads this as context for its next decision. Whether the agent actually validates depends on the model’s interpretation of “validate”, whether the checklist tool is in the agent’s current tool set, whether earlier context has compressed the constraint into something softer, whether the agent judges the check as applicable in this specific instance, and whether any prompt injection in the retrieved data has altered the agent’s priorities.

None of these are enforced. All of them are probabilistic.

The same instruction in code is different:

def submit_filing(draft):
    result = validate_pleading_checklist(draft)
    if not result.passed:
        raise ValidationError(result.failures)
    return submit_to_court(draft)

The check runs. Not because the agent decided to run it, but because the function cannot return without it. An examiner asking “how do you know this check ran” gets a commit hash, a test, and a log line. The same question asked of a prose-instructed agent gets an eval score and a hope.

This is the part of agent architecture that isn’t being built yet. The market is currently rewarding demo fluency over production discipline, and until that shifts, skills are suggestions with better formatting. The skills pattern is a real attempt at enforcement: a file the agent reads before acting, containing instructions and constraints for a task class. Reading a skill and obeying it are not the same thing. Until the harness refuses to produce the artifact unless the skill’s required tools actually ran, the enforcement lives in the model’s compliance, not in the architecture.

Blast radius, not regulation

The reflex framing for this problem is that it applies to regulated work. The regulated part is real, and examiners will eventually ask the harder version of the question. The regulatory frame understates the problem.

The actual axis is blast radius. A Ramp CLI exposed over MCP to an agent harness gives the agent the ability to categorize transactions, book journal entries, and post approvals. None of this is regulated in the FINRA sense. All of it has a recovery cost that dwarfs whatever productivity the agent produced if it goes wrong. If the agent misinterprets a column header in an uploaded CSV and books ten thousand transactions to the wrong GL account, the finance team’s next three weeks are gone. The MCP spec described types. It did not describe semantics. The agent’s interpretation of “amount” as dollars when the system meant cents is a correctness problem no type check catches.

This extends to CRM mutations that touch pipeline data downstream systems consume, email sends that can’t be unsent, calendar rescheduling across a partner’s executive assistant, trade order placement in any brokerage workflow, data export to external collaborators, and permission changes in identity systems. None of these require regulatory oversight to be catastrophic if they fail silently.

Non-regulated work with blast radius faces the same problem as regulated work with less pressure to face it. Firms wiring up agent harnesses to production systems are accepting risk they haven’t priced, because the enforcement infrastructure that would price it (mandatory skill invocation, code-level gates, durable audit state) isn’t standard yet.

What the harness actually needs

Five capabilities separate a harness that can carry high-stakes specialist work from one that can’t.

Mandatory skill invocation enforced by the orchestrator, not by the agent’s reading of a prompt. The harness refuses to produce the artifact unless the relevant skills loaded and their required tools ran.
Code-level gates inside skills. Validators return pass/fail and halt on fail. Prose assertions that the agent should check X do not enforce.
Persistent audit state the agent cannot rewrite. Decision logs, tool calls, approval events, and provenance live outside the agent’s context.
Clean pause-and-resume at human-in-the-loop points. When a licensed human must approve, the harness waits without the agent improvising a workaround.
Context stability across long tasks. Constraints loaded at the start of a task remain in scope at the end. Summarization and compaction don’t silently erase them.

Cowork does parts of this. It reads skills when tasks match. It runs arbitrary code, including validators. It supports human confirmation on tool calls. What no harness today enforces at the orchestrator level is the must-run policy: the rule that says this task class cannot complete without these skills, and these skills cannot complete without these gates. That policy is authorization infrastructure, and it has to be built on top of the harness, not expected from inside it.

Until that layer exists, the realistic architecture is agents running inside pipeline cages. The control flow lives in code. The pipeline decides what runs and in what order. The agent fills the nodes where judgment is the actual work: drafting, summarizing, classifying ambiguous input. The agent’s intelligence goes where intelligence is the bottleneck. The pipeline’s determinism goes where correctness is the bottleneck. Neither tries to do the other’s job.

The ceiling won’t close cleanly

The interesting question is whether this gap closes as models get better. My read is that it doesn’t, for a class of work that won’t shrink to zero. Three reasons.

Model reliability improves asymptotically, not absolutely. 99.9% reliability still means 100 bad artifacts per 100,000 runs. For outputs with high recovery costs, that floor matters regardless of how good the ceiling gets.

Examiners and post-incident reviews want to see the gate that ran, not the model’s reasoning about whether the gate was applicable. An enforced check produces a record. Emergent control flow produces a narrative. Records are defensible in ways narratives aren’t.

Prompt injection and context manipulation are adversarial surfaces that code pipelines don’t expose. The more of the control flow the agent owns, the more of that surface exists. Better models don’t reduce the attack surface. They change what attacks work.

The practical direction for architects building on Cowork, on Headless 360, on any agent harness: treat the harness as an execution environment, not as a policy engine. Put the policy in code outside the harness. Let the agent fill the holes. This is less exciting than the vendor pitch. It is also the architecture that holds up when the demo ends and the real traffic arrives.

The tools will keep improving. The open question is whether the harnesses carrying them to production enforce the constraints that make output trustworthy, or whether the industry discovers the missing enforcement layer one incident at a time.

The author works on AI and data infrastructure for wealth management at Advisor360°, a domain where regulation and blast radius converge and where the gap between prose skills and enforced pipelines shows up in production.

References

Shah, A. V. and Levy, J. Y. (2026). Access to Justice in the Age of AI: Evidence from U.S. Federal Courts. Working Paper. https://avshah1.github.io/assets/pdf/papers/pro-se/Pro_Se_Automation.pdf

Prose Is a Suggestion. Agent Harnesses Need Cages. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.