The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards
arXiv:2602.14872v2 Announce Type: replace-cross
Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.