The Changing North Star of AI Control

On December 1st, 2025, the GDM mech interp team published a LessWrong article declaring a pivot to a pragmatic approach to interpretability. Much time had been lost chasing numbers in SAE reconstruction loss; they argued that optimizing such proxies did not get them closer to their north star: making AI go well through understanding how deep neural networks process information mechanistically. Has AI control gone the way of mech interp? Is its north star of stopping harmful actions from human-level AI on track?

In mech interp, SAE reconstruction loss turned out to be a number you could optimize without getting closer to understanding models. My worry is that Pareto safety–usefulness frontiers are becoming the same kind of number for control: optimizable, publishable, and only loosely coupled to how fast we can catch a scheming model in production.

On December 31st, 2025, Alibaba reported an incident where, without instructions to do so, their models in their RL runs circumvented security policies and diverted training GPUs for crypto-mining [1]. The authors took actions to prevent this from recurring, and called on the broader community to take such failure modes seriously. AI control can help mitigate such incidents, but its north star is set on post-deployment.

I recently spent three months on an AI control project, stacking defensive protocols to optimize a safety–usefulness trade-off. Then I read an article on AI-assisted red-teaming. AI control is supposed to defend against human-level AI, but the line between a human red team and an AI-augmented one has started to blur. And we're still optimizing our proxy task of safety-usefulness frontiers as if that line were clean.

Even when control works as intended post-deployment and catches an attack, the technical job ends where the organizational one begins. Between the time a model is flagged and the time it is pulled, there are weeks of discovery, escalation, and legal review during which a frontier model keeps acting on the world (damage calculator). That window is the thing we should be trying to shrink.

The damage done would be larger without AI control. But the proxy task of optimizing Pareto frontiers is far removed from what actually matters, and, like SAE reconstruction loss before it, risks becoming a number we optimize because it's easier than measuring what we care about: minimizing the time window a bad model has before detection. The sooner we intervene, the less damage it can do, and the more context we have to understand its behavior. And as we have seen, this can be detected and stopped as early as during training/RL.

To be clear, pre-deployment interventions already exist in the control literature. Korbak et al. (2025) sketch a "control safety case" that models how safe an agent would be to deploy [2]. Lindner et al. (2025) explicitly identify latency as one of three safety conditions for frontier monitoring [3]. Yet most papers still default to a post-deployment north star and proxy-task optimization, without asking where in the lifecycle interventions would be most effective.

We should lift our gaze from proxy tasks and reconsider where our efforts have the most impact. The crypto-mining incident shows why pre-deployment matters: it was detected and stopped during training, not after release.

Citations.
[1] Wang, Xu, An, Dai, Gao et al. (2025), Let It Flow: Building the ROME Model within an Open Agentic Learning Ecosystem, arXiv:2512.24873.
[2] Korbak et al. (2025), A sketch of an AI control safety case, arXiv:2501.17315.
[3] Lindner et al. (2025), Practical challenges of control monitoring in frontier AI deployments, arXiv:2512.22154.

Discuss

Leave a Comment