On Randomness in Agentic Evals
arXiv:2602.07150v3 Announce Type: replace-cross
Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a …