Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
arXiv:2604.12379v1 Announce Type: cross
Abstract: Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not desig…