Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
arXiv:2605.11599v1 Announce Type: new
Abstract: Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without au…