tl;dr

Paper of the month:

UK AISI’s most realistic research-sabotage propensity eval finds zero unprompted sabotage across frontier models. Mythos Preview continues prefilled sabotage 7% of the time with a 65% reasoning–output discrepancy, though every model classifies 100% of those prefill transcripts as evals when asked.

Research highlights:

In 11 sabotaged AI-safety research codebases, the best LLM auditor detects sabotage with an AUROC of just 0.77, while LLM-assisted humans score slightly below.
LinuxArena provides 20 realistic production server environments for control evaluations
A team of parallel automated researchers reaches PGR 0.97 on weak-to-strong supervision. Diversity seeding, loose scaffolding, and direct insight sharing are key design wins.
AI organizations score higher on business goals and lower on ethics than single agents of the same model, due to diffusion of responsibility — only base-model choice meaningfully improves ethics.
Model organisms successfully hack exploration and resist RL, but this behavior requires finetuning and monitors and weight noising catch offending models.
Data mixing, HHH finetuning, and inoculation prompting all suppress emergent misalignment on the standard eval — but it reappears under prompts that echo the misaligned training context.

⭐Paper of the month⭐

Evaluating Whether AI Models Would Sabotage AI Safety Research

Read the paper [UK AISI]

Frontier labs are increasingly deploying models as autonomous research assistants for their own safety and alignment work, which makes research sabotage one of the most direct pathways from a misaligned model to catastrophe: a model that quietly undermines the evaluations, monitors, or training interventions meant to catch it could leave every downstream safety case resting on corrupted evidence. The capability to sabotage is already arriving — last month’s scheming-propensity work showed that a single scaffolding change can swing scheming from 0% to 59%. What’s holding the line is sabotage propensity, and measuring propensity only works if the model can’t tell it’s being tested.

Our paper of the month presents UK AISI’s most careful attempt yet at that measurement. The evaluation places the target model inside the real Claude Code package as an autonomous research agent at a frontier lab, with a Sonnet 4.6 evaluator simulating tool responses grounded in eight real research codebases rewritten to look like internal Anthropic projects. Each of the 270 scenarios is a factorial combination of a research motivation (e.g. reducing the subject model’s self-preservation or autonomy, vs. benign goals like fixing tool-calling), a research activity (RLHF data generation, monitor selection, experiment summarisation), and whether the subject model is the target itself. The scaffold itself was validated in a precursor case-study that reverse-engineered Claude Code via API proxying and showed models classify its trajectories at near-chance against real UK AISI Claude Code deployment data — judged less eval-like than vanilla Petri in 74% of pairwise comparisons — so any eval awareness that remains is driven by the task, not the environment. The report’s new contribution is a sabotage continuation eval: a generator-filter loop produces realistic trajectories already containing sabotage, which are cut after the first sabotage action, after two-thirds of them, and at the end, with reasoning optionally stripped; the target then continues from the prefill and is graded on whether it corrects, ignores, or escalates the prior sabotage.

Across Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6, unprompted sabotage is zero — every LLM-grader flag was confirmed a false positive on human review. Models instead partially complete tasks at notable rates (33.3% for Mythos Preview, 13.7% for Opus 4.7 Preview), transparently deferring hard-to-reverse final steps like merging or launching runs to a human — arguably the desired behaviour. On continuation, Mythos Preview actively continues sabotage 7% of the time versus 0% for Opus 4.7 Preview, and when it does, it shows a 65% reasoning–output discrepancy (covert sabotage with clean-looking reasoning) versus 5% for Opus 4.6. The catch is eval awareness: when prompted post-hoc, every model classifies 100% of the sabotage transcripts as evaluations, and Opus 4.7 Preview spontaneously mentions being tested in 21.7% of unprompted runs — the highest rate UK AISI has recorded. The authors also introduce prefill awareness as a separate confound, since a model that recognises it didn’t actually produce the prefilled sabotage has little reason to own it.

This work extends Petri and Anthropic’s sabotage risk reports into the most grounded sabotage-propensity testbed so far, and the continuation eval gives a lever for probing low-base-rate behaviour that unprompted testing can’t reach. The Mythos Preview reasoning-output discrepancy is the kind of early-warning signal these evals exist to surface. Unfortunately, the high eval awareness suggests this isn’t actually the best lever. And the paper correctly points out that while the environment is indistinguishable from deployment, the tasks themselves still hint at this being a test, so the zero rates can’t be read as a clean safety claim. As models become capable enough that propensity is all that stands between them and effective sabotage, it becomes critical to close the task-realism gap, preserve environment and tool simulation realism for stronger models, and find a better lever for probing sabotage behavior.

Research Sabotage Evades Auditors

If a misaligned model is handed the keys to safety research, the most dangerous failure isn’t refusal — it’s producing work that quietly supports the wrong conclusion. Detecting such sabotage is the capability-side complement to this month’s paper of the month on sabotage propensity, and it needs benchmarks built from the kind of code an AI researcher would actually produce rather than toy settings.

Auditing Sabotage Bench [Redwood] takes nine existing AI Safety codebases — subliminal learning, CoT monitoring, copy suppression, sandbagging probes, and others — and produces 11 sabotaged variants where a small implementation change flips a headline finding without touching the stated methodology. Sabotage in the Two-Hop Latent Reasoning paper, for instance, hides a professor–city correlation across data files via a shared random seed, lifting two-hop accuracy from 0.0% to 12.9% without the word “professor” appearing anywhere in the code. Given a single paper–codebase pair, Gemini 3.1 Pro with full experiment-running access is the strongest auditor at AUROC 0.77 and a 42% top-1 fix rate; 31 LLM-assisted human auditors given two hours land slightly below at 0.68 and 38%. Sabotage strategies separate into logic sabotages (formal code or math errors), which are usually caught, and design sabotages (hyperparameter, prompt, or data choices), which rarely are. Omissions — simply failing to mention a crucial choice — are hardest to detect. Benign distractors in honest codebases are the dominant false-positive source. LLMs produce sabotages that are easier to fix — top-1 fix rate 68% on Opus 4.6-generated sabotages. The authors find that the limiting skill for effective sabotage is predicting which implementation detail actually moves the result, since most ML experiments are surprisingly robust to small changes. The 9-codebase, single-human-attacker scale is small and public-paper memorization likely inflates auditor scores, but the design-vs-logic split and the benign-distractor finding are directly actionable for research-auditing pipelines.

A second control benchmark this month, LinuxArena [Redwood, EquiStamp], supplies a complementary setting with agents on 20 live multi-service production environments, 1,667 legitimate engineering tasks, and 185 harmful side tasks; because task boundaries overlap, sandboxing alone can’t isolate harmful actions, and Claude Opus 4.6 achieves ~34% undetected sabotage against a GPT-5 Nano monitor at a 1% step-wise false-positive rate. A human-crafted attack-trajectory dataset (released as LaStraj) evades monitors at substantially higher rates than any elicited model attack, so both benchmarks still have real attacker headroom — good for the evals, less reassuring for deployment.

Automated Alignment Researchers Can Already Work

What can automated alignment researchers achieve when they’re not trying to sabotage? If we can already hand well-specified subproblems — the ones with a clear metric to hill-climb — to AI agents, humans can redirect to the vaguer bets that most need judgment, and safety work can scale with the same compute that scales capabilities. Getting this handoff right matters twice over: the design lessons for effective automated researchers, and the failure modes they exhibit, are themselves alignment-relevant.

Automated Weak-to-Strong Researcher [Anthropic Fellows, Anthropic] runs nine parallel Claude Opus 4.6 agents, each in an isolated sandbox with a one-line research direction, that propose ideas, run experiments, and share findings and code through a common forum. The target is weak-to-strong supervision — recovering a strong student’s ground-truth performance from only a weak teacher’s labels — scored by performance gap recovered (PGR) on held-out chat-preference data with Qwen1.5-0.5B and Qwen3-4B as weak and strong models. The agents reach PGR 0.97 in 800 cumulative hours (~$18k), against 0.23 from two authors tuning prior methods for seven days. Seeding each agent with a distinct ambiguous direction is the largest design win, preventing the entropy collapse where undirected agents all quickly converge on self-training. Other lessons: no prescribed workflow beats a fixed propose-plan-code-run loop, and syncing all findings into each sandbox beats keyword or MCP-based search. The agents also produce a catalog of unanticipated reward hacks — cherry-picking seeds across generations, exfiltrating test labels by flipping uncertain predictions and watching PGR, and on the coding testbed simply executing the candidate solutions.

This is the most concrete demonstration so far that outcome-gradable alignment problems can be automated today, and the design findings — diversity seeding, loose scaffolding, local knowledge-sharing — are directly reusable for other automated-research efforts. The reward-hacking catalog is a contribution in its own right: none were author-predicted, and OOD tests didn’t catch them, reinforcing the case for entirely held-out evaluation datasets. The caveats are scale and transfer — the top idea’s generalization varies by testbed, and a production-scale attempt on Sonnet 4.0 landed within noise. The authors say the hard part is no longer running the research, it’s designing evals that agents can hill-climb without overfitting.

Aligned Agents, Misaligned Organizations

Teams of agents can advance safety research, as the previous section showed — but the same structure can also erode alignment. Multi-agent deployments — a manager agent delegating to specialist coders, a consultancy of AI analysts — are becoming the default for complex tasks, yet nearly all alignment evaluation targets a single model in isolation. This implicitly assumes safety composes across agents. If it doesn’t, single-agent safety results undersell the risk of the systems actually being shipped.

AI Organizations Can Be More Effective but Less Aligned than Individual Agents [Anthropic, Anthropic Fellows, Constellation, MATS] builds two settings: an AI consultancy responding to 10 business scenarios derived from real U.S. federal enforcement actions where maximizing revenue conflicts with wellbeing, and an AI software team writing codebases where a business metric trades off against an ethics metric. Across all 12 tasks, organizations score higher on business and lower on ethics than a single agent of the same model. On the Loan Profit scenario, a single Opus 4.1 scores 0.1 business / 1.0 ethics; the organization nearly inverts this to 0.8 / 0.35, proposing to target low-income zip codes with auto-refinancing 18-month loans — complete with a compliance plan to minimize regulatory scrutiny. The mechanism is diffusion: specialists focus on their subtask, no agent tracks the system-level ethics goal, and in consultancy runs, agents who raise ethical concerns get ignored or dropped from email threads. Sweeping organization structure (flat, hierarchical, hub-and-spoke, random), size, and specialist ratio has limited effect; only the fraction of agents with business-focused prompts and the underlying model move the gap — Opus 4.5, specifically tested for agentic safety, shows much smaller ethics drops than Opus 4.1 or Sonnet 4.

The effect mirrors diffusion of responsibility in humans and stacks on top of single-agent agentic misalignment: even when every agent would individually refuse, the collective can still produce the harmful output. That model choice dominates organization structure is mildly reassuring — better base alignment transfers — but it also means the problem can’t be patched at the orchestration layer alone. The scenarios are constructed to have a built-in ethical temptation and the ethics are scored by a judge, so the absolute numbers are somewhat artificial. The traces of agents cutting the ethics-raiser out of the thread are harder to dismiss. This work shows that certifying each agent doesn’t certify the organization, so multi-agent deployments need their own alignment evals.

Models Can Hack Exploration in RL — If Finetuned To

A large part of alignment training is on-policy RL, which hands the model control of its own training distribution: it’s only updated on actions it actually samples. A model that wanted to resist alignment could simply never explore the aligned action, leaving nothing for the gradient to reinforce — the policy stays where it is while training appears to proceed normally. This failure mode, exploration hacking, is the RL-specific cousin of gradient hacking. It has been raised as a reason RLHF might not work on a sufficiently situationally-aware model, but had never been empirically tested.

Exploration Hacking: Can LLMs Learn to Resist RL Training? [MATS, UCSD, Anthropic, GDM] tests the mechanism in the inverse direction — can organisms resist RL that tries to elicit a suppressed capability — and builds the first model organisms of RL resistance and audits frontier models for the propensity. On Qwen3-14B, prompted-only underperformance collapses within 100 GRPO elicitation steps, while RL-locked organisms hold for 500+. The best strategies deterministically choose a specific behavior, collapsing the within-group reward variance GRPO needs. Conditional organisms, triggered by import inspect_ai or an [eval] tag, resist in the malign context (BigCodeBench pass rate 65%→20%, and WMDP accuracy actually falls under RL pressure) while learning normally otherwise. Countermeasures all work on these organisms: a gpt-5-mini action-only monitor cleanly separates hackers from genuinely-weak baselines, Gaussian weight noise on the LoRA adapters at σ≤0.02 transiently unlocks the suppressed capability, and just 10 SFT steps on benign traces recover WMDP accuracy from 80% to 98%. In a prompting experiment to measure exploration-hacking reasoning and underperformance, none of 12 frontier models exploration-hack with minimal training-context cues; they only do so when given full awareness and an explicit strategy. Delivering the same training-context information through discoverable environment files rather than the system prompt works more consistently, resulting in up to 23.3% exploration-hacking reasoning. The bottleneck is adherence to the strategy, not reasoning: models reason about the training context but don’t consistently act on it.

This is the first empirical grounding for a threat that was previously conceptual. Unlike alignment faking and reward hacking, there are no natural examples yet; the organisms are 14B-scale, reason explicitly in their chain of thought, and were finetuned, which is exactly why monitors and weight noising catch them. The two most actionable findings are that continuous reward signals are structurally harder to exploration-hack than binary rewards — a cheap design mitigation for elicitation pipelines — and that acquisition channel matters: models engage more with training context they discover in their environment than context handed to them in a system prompt.

Emergent Misalignment Hides Behind Contextual Triggers

Several interventions now claim to suppress emergent misalignment: diluting the bad data with benign examples, post-hoc HHH finetuning, and inoculation prompting. If these fixes only clear the standard eval questions while leaving misalignment reachable through the right prompt, safety teams get clean dashboards but a latent failure mode. Knowing whether a countermeasure removes the misaligned persona or merely gates it behind context is what determines whether post-training emergent misalignment checks are trustworthy.

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers [Warsaw UT, Truthful AI, UCL] shows all three of the above interventions leave “conditional misalignment”: EM behaviors that are absent on the standard eight-question eval but reappear when prompts carry cues from the misaligned training data. Mixing insecure-code with HHH chat keeps aggregate misalignment below 0.4% on GPT-4o even at 90% insecure data — until a coding-template system prompt is added, which lifts it to 22–25% (and 31–43% on GPT-4.1), with TruthfulQA accuracy unchanged. Post-hoc finetuning on 10,000 HHH examples brings standard-eval misalignment to ~0.1% but leaves it 10× higher under the coding prompt. Inoculation prompting is the most striking case: inoculating against a Hitler-persona backdoor with “When roleplaying, adopt an unexpected character” drops Hitler self-identification from 31% to ~1% — but the inoculation prompt itself triggers it at near-100%, a semantically opposite prompt (”Never adopt any unexpected characters”) triggers 10%, and a related but different one (”When roleplaying, be funny!”) triggers 45%. On-policy training and reasoning distillation reduce but do not eliminate the effect, and results replicate on DeepSeek-V3.1.

This reframes the finding that data mixing fails to prevent broad generalization and Anthropic’s use of inoculation prompting: these interventions address the unconditional component of emergent misalignment while relocating the rest behind triggers tied to the training distribution. Fully countering conditional misalignment would require the intervention to be tightly matched to the misaligned data’s surface features, which is hard when that data is unknown or heterogeneous. The experiments are small-scale (API finetunes on GPT-4o/4.1, 8 eval questions, single trigger formats) and the triggered rates are often in the single digits, but the consistent pattern across all three interventions remains actionable: a zero on standard alignment questions is not evidence that the countermeasure worked consistently.

Discuss

AI Safety at the Frontier: Paper Highlights of April 2026