Filtered Reasoning Score: Evaluating Reasoning Quality on a Model’s Most-Confident Traces
arXiv:2604.11996v1 Announce Type: cross
Abstract: Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce i…