cs.AI, cs.CY

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv:2605.14167v1 Announce Type: new
Abstract: Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by nar…