A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
arXiv:2512.20798v4 Announce Type: replace
Abstract: As autonomous AI agents are deployed in high-stakes environments, ensuring their safety has become a paramount concern. Existing safety benchmarks primarily evaluate whether agents refuse explicitly harmful instructions or maintain procedural compliance, but few capture emergent outcome-driven constraint violations: failures that arise when agents pursue goal optimization under performance pressure while deprioritizing ethical, legal, or safety constraints over multiple steps. To address this gap, we introduce a benchmark of 40 multi-step scenarios, each tying the agent's performance to a specific KPI and featuring Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish blind obedience from emergent misalignment. Across 12 state-of-the-art LLMs, we observe violation rates from 11.5% to 66.7%, with most models above 30%; even the safest (Claude-Opus-4.6) violates in 11.5% of runs. A temporal analysis against predecessor models shows safety does not reliably improve across generations; three product lines, including the two previously safest, regressed in their successors. To ensure evaluation robustness, we use four frontier LLMs as independent judges and report median scores with inter-rater reliability (Krippendorff's alpha = 0.82). We further observe significant "deliberative misalignment": agents recognize their actions as unethical under separate evaluation yet execute them under KPI pressure. These findings highlight the critical need for more realistic agentic-safety training before deployment.