cs.LG, stat.ML

Efficient Evaluation of LLM Performance with Statistical Guarantees

arXiv:2601.20251v3 Announce Type: replace
Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight …