Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
arXiv:2603.27343v1 Announce Type: new
Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Workin…