Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
arXiv:2510.02837v2 Announce Type: replace
Abstract: Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivit…