Emergent Slow Thinking in LLMs as Inverse Tree Freezing
arXiv:2509.23629v3 Announce Type: replace-cross
Abstract: Reinforcement learning with verifiable rewards (RLVR) enables large language models to acquire slow, multi-step reasoning from sparse final-answer signals. We provide a statistical-physics picture of this emergence. We show that an autoregressive model's finite capacity forces it to compress its exponentially large prefix space into a Markov network of predictive states, on which slow thinking unfolds as a random walk -- the Concept Network (CoNet) picture. Within CoNet, RLVR dynamics are governed by two mechanisms: merging of compatible paths and frustrated competition among incompatible ones. Together they drive the network through nucleation, growth, and freezing into multi-input, single-output directed inverse trees. The picture reproduces the training dynamics of a 1.5-billion-parameter LLM and yields three predictions: reasoning chains lengthen as a geometric necessity of sparse topology; SFT induces catastrophic forgetting through bridge-node rupture; and frustration drives policy collapse. Building on the structural timing inherent in inverse-tree freezing, we propose Annealed-RLVR -- a brief SFT intervention at the moment of maximum frustration. It outperforms standard RLVR on both in- and out-of-distribution benchmarks, with the largest gains at high sampling budgets where standard RLVR collapses. The same SFT applied after the trees freeze instead triggers catastrophic forgetting, isolating timing as the active ingredient.