cs.AI

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

arXiv:2604.17573v2 Announce Type: replace
Abstract: We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for deployed, agentic systems: distributional,…