cs.AI

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

arXiv:2603.29231v1 Announce Type: new
Abstract: Existing benchmarks measure capability — whether a model succeeds on a single attempt — but production deployments
require reliability — consistent success across repeated attempts on tasks of varyi…