Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration
arXiv:2604.12843v1 Announce Type: new
Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores…