Control Debt

Notes on the gap: what control evaluations assume <> implementation in labs.

It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/useful ratio seems good.

Now they look at production.

On-call engineers pull up recent agent shell history. Four hours, then nothing. Some shell commands are logged. Tool outputs aren't. Some actions run via legacy CI processes, and their logs are deleted after one week. The agent stores memory across sessions, else it'd forget project context. Production also uses shared long-lived repo credentials, because per-task credentials were slowing down engineers. The monitor can get code diffs, just the full scaffold state that's needed to produce them is missing. Code reviews: exist. But happen mostly only after merging. The escalation policy is "ping the safety channel", and it isn't fully clear whose authority is implementing stops. Some of the logging infrastructure was refactored last quarter - by an AI agent.

Each of those engineering decisions was considered well, back at the time. It unblocked an engineer, it got a feature out sooner. A tool didn't fight the people using it anymore. None seemed to be safety problems, nobody was trying to run a safety case against them. The costs must be paid now. This is a debt building up when a lab makes decisions that give speed but don't ensure to maintain conditions thorough control will need. When a lab tries to make a safety case later, accumulating shortcuts make it more difficult to get the required evidence.


Pieces that are already happening

Examples at Anthropic:

Anthropic's product page for Claude Code states that the majority of code at Anthropic is now written by Claude Code, with engineers focused on architecture and deciding what to build. This means, the internal codebase that any future control protocol would have to be applied to is being co-authored by the model whose sabotage is the risk.

Anthropic's Opus 4.6 Sabotage Risk Report, published in early 2026, acknowledges that the company's monitoring isn't comprehensive. Internal use is covered better than deployments to outside users. METR, over three weeks of red-teaming a subset of Anthropic's internal agent monitoring and security systems, found several specific vulnerabilities, some of which Anthropic has since patched. The red-team work shows that even thorough internal monitoring at a large lab has weaknesses when attacked.

Anthropic's engineering team also wrote that, in internal use of Claude Code, sandboxing "safely reduces permission prompts by 84%." Sandboxing has safety value, but the number also says the per-action approval flow was causing enough trouble that someone had to build a way around it. Whichever mechanism gets built when that happens - and there are good versions and bad versions - determines if the relevant safety properties stay up.

Even at a lab that is publicly working on this problem: The effectiveness of internal monitoring still depends on details like what gets logged, which actions get blocked before they run, what humans get to look at, and which monitors see which actions.


More than clean engineering

Most of the problems I described above are bad engineering practice. So why not just frame them as security issues?

Control engineering has a different aim than security. Security tries to keep the system from being attacked. Control tries to keep the system in a verifiably safe state even if the AI is intentionally trying to subvert it. That depends on monitors seeing the relevant actions; logs with enough detail to rerun a better monitor over them later; knowing which model, prompt, and tool call produced any given action; detection before damage happened. Some of those properties are already missing when there's a security weakness - that's the overlap. But beyond that, it is already problematic if apparently normal engineering decisions cause important control properties to be missing.

The example I'll use: shared agent memory. It might pass a normal security review without problems - encrypted, scoped, audited. From a control perspective it can still be a disaster. The protocol's safety estimate came from an evaluation in which instances of the model couldn't talk to each other. In production, with shared memory, they can. Through a channel that wasn't in the eval. Resampling, for example, only works because the protocol gets to treat each call as a new sample from the model. Hidden state shared across calls disrupts that. The safety estimate is for a different system than the one running now.


How control debt may appear

Large permissions:

Agents have credentials, repo access, shell access, production API keys, long-lived tokens - because limiting permissions is annoying, and having different credentials for different tasks takes more time. But control protocols that depend on inspecting/blocking suspicious actions before they happen don't work if the agent already got permission for them. Work on control monitoring in frontier deployments shows: asynchronous monitoring fails when the agent can cause harm before reviews of its actions happen. And by the time someone tries to change this, the workflow adapted to a freely acting agent. Setting permission up again isn't great for productivity.

Logs needed:

Logs in an engineering team exist for debugging - you read them when something happens. Logs in a system that should support a control argument do something else: they're the input to running a future better monitor over past activity, and they're material that the red team needs for testing their attack policies. Most of the logging is set up for debugging, which means shell commands and merge events get captured. But full states of scaffolds, hidden tool outputs, and exact model versions possibly not. Any attempts by the safety team that depend on running monitors over past activity, or on the red team's testing setup matching what the model used in production, don't work. And data not stored months ago is gone.

Expected isolation:

Some protocols depend on information not being shared between models, like in the resampling example I mentioned: the protocol assumes that it can get a new sample from the model, and that the model has no memory of attempts. If we use shared memory in production, or persistent context/cached artifacts/hidden state across calls, that assumption fails. Untrusted monitoring with anti-collusion makes assumptions about which channels exist between instances; production usually has more channels than the test setup did. Buck Shlegeris' post on novel security-critical infrastructure similarly says: when AI workers share resources for efficiency, you give up on a separation helped for safety.

Untrusted AI agents write trusted code:

AI agents write and refactor monitoring code, sandbox code, schedulers, and deployment scripts. That means parts of the monitoring system the safety case relies on are also written by the model being monitored. AI-authorship in code doesn't make it untrusted. Humans review and test it, and most of Claude Code's use has a human approving changes. The problem is not "an AI touched this file." It is that control-critical code was modified by an AI agent, and there does not tend to be a review process that puts it into an "AI code" category, so that it is flagged as AI-authored in the safety case. Product code and monitoring-or-shutdown code would need different review standards, and treating them the same creates such a debt. The post on novel security-critical infrastructure writes about why this is hard to avoid and threat model changing when it happens.

Unused escalation:

Let's say the monitor catches the model attempting a malicious action. What happens next? In some labs, it can be "ping the safety channel on Slack." That could help. If pinging the safety channel causes something to happen - someone with authority investigates, jobs get stopped, a chain of command kicks in even when one person is on vacation. A lot of the case for control relies on "we'd catch the model before disaster." If catching may not lead to stopping, most of the use is lost. This is organizational debt rather than technical, so nothing about it is flagged as a bug. The procedure never ran. When it has to run, the people involved find out they don't know what to do. "Ten people on the inside" and Ryan Greenblatt's note on why labs are unlikely to make high-assurance safety cases under short timelines cover some of why getting these procedures written down and practiced is harder than it should be.


Problems from this debt

Once there is a shortcut, a lab adapts to it. Workflows get used to agents with credentials. Shared memory making agents more useful means researchers building pipelines with dependence on it. If, then, the safety team raises an issue, there is already productivity benefit spread across many people whose work would slow down if it's revoked. Also, the engineering costs of fixing it increase over time. Plus, the longer the wait, the more of that system is touched by the thing that this worry is about (Buck Shlegeris made this point in more depth). And there can be a political cost of workflows, careers, and projects involved, that rely on that shortcut being in place.

Some of it doesn't behave like debt. Most technical debt can be paid back with effort. Some of this control debt can't. If a log wasn't taken months ago, it doesn't exist. Trusted code that AI agents have been writing and refactoring for months and a year can't just be cleaned up with more engineering hours. So the debt analogy isn't perfect - some items are closer to sunk costs or bets that can't be reversed than to debts that can be repaid.


Implications for control research

Together with control evaluation results, deployment assumptions should be reported. Control results like "protocol P achieves X% safety at Y% usefulness" are valuable, but may include deployment assumptions that result depends on: levels of log completeness, attribution granularity, isolation between calls, what the monitor can see, how fast suspicious actions get blocked, reliability in escalation to actual response, etc. "Thoughts on the conservative assumptions in AI control" and sketch of an AI control safety case give partial versions of this for the model side.

For benchmarks, messy variants could be included. Frameworks like ControlArena are neat by design. And that's useful for studying the protocols. It would be valuable to study how they behave with reality-troubles, though: having added settings with partial logs, ambiguous attribution, post-hoc review, permissive credentials, and shared state across calls. We could ask: how much does protocol P get worse through changes to a more realistic messy setting, and which of the messy properties are especially important for which protocols? A resulting degradation curve could be used in deployments.

Labs could run a sort of fitness test for control protocols on their systems. Would running a non-trivial control protocol on the internal agent system work tomorrow? If not, how could it? The answer is a list of what's in the way, a starting point for this work.

Much of the control measures research is about the surroundings the protocols have to run in - monitoring infrastructure, log protection, attribution, escalation, secure deployment. They are not extras to be figured out once the protocol is selected. They are decisive for any protocol's safety estimate being true for the running system.


When does this matter less

A few worlds where I expect the concept to help less.

Timelines: If we get from current systems straight to powerful uncontrollable systems without any period of useful controlled internal deployment, the question is less if the production system supports a controllable protocol. (In short timelines that do include such a period, this matters more than usual, because there's less time to fix or set up the missing things.)

If we get good evidence that models in the relevant capability range don't scheme, the protocols and their engineering don't need to be as rigorous.

Some of these issues I mentioned are also useful for debugging, performance, and normal security. If they get built for these purposes, some required infrastructure for the safety case ends up implemented already for other concerns.

The concept can still be useful in those worlds. It helps to track what in a future safety argument is well-supported, and that is useful no matter which exact scenarios we find most likely.

So, if a lab will need a control safety case in 2027, the engineering choices made in 2026 are relevant for it. They are worth paying attention to now.




Discuss

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top