Skyler Wu, Yash Nair, Emmanuel J. Cand\`es

Efficient Evaluation of LLM Performance with Statistical Guarantees

Skyler Wu, Yash Nair, Emmanuel J. Cand\`es / May 12, 2026

arXiv:2601.20251v3 Announce Type: replace
Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight …

Author name: Skyler Wu, Yash Nair, Emmanuel J. Cand\`es

Efficient Evaluation of LLM Performance with Statistical Guarantees