Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
arXiv:2605.03862v3 Announce Type: replace-cross
Abstract: Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the rea…