Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation
arXiv:2510.04265v4 Announce Type: replace-cross
Abstract: Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limite…