cs.AI, cs.CL, math.ST, stat.ML, stat.TH

Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation

arXiv:2510.04265v4 Announce Type: replace-cross
Abstract: Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limite…