MachineLearning

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench – 56 Llama 3.3 70B models with planted hidden behaviors – their best agent detects the behaviros 10-13% of the time (42% with a super-agent aggregating many parallel runs). a central finding is the "tool-to-agent gap&quo…