cs.AI, cs.CR

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv:2605.12673v1 Announce Type: new
Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing…