What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
arXiv:2605.02038v1 Announce Type: new
Abstract: Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses fo…