SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

arXiv:2603.27977v2 Announce Type: replace Abstract: Reinforcement learning is critical to improving large reasoning models, but its success relies heavily on verifiable rewards (RLVR), making it hard to use in open-ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimizing solely toward the final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning), and we extend traditional RLVR to open-ended settings. We introduce Structure-Aware Reinforcement Learning (SARL), a label-free framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their reasoning topology. SARL shifts supervision from destination to path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks, SARL outperforms prior label-free RL baselines and even exceeds RL methods with ground truth supervision, with average gains of +9.1% under PPO and +11.6% under GRPO across four math benchmarks, with particularly large improvements on AIME25 (+35.5% with PPO and +44.7% with GRPO). On non-verifiable open-ended tasks, SARL achieves average gains of +34.6% under PPO and +30.4% under GRPO on WildBench across five task categories, outperforming prior label-free RL methods and DPO, which relies on additional preference labels. Beyond strong performance, SARL exhibits substantially lower KL divergence and higher policy entropy, indicating more stable and exploratory training dynamics. Code and data are available at \href{https://github.com/cacayaya/SARL}{Code Link}.

Leave a Comment