cs.LG

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

arXiv:2605.11209v1 Announce Type: new
Abstract: While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reli…