When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
arXiv:2605.00817v1 Announce Type: new
Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. W…