How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
arXiv:2604.05134v2 Announce Type: replace-cross
Abstract: We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL stage elicits \textit{unfaithful} reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We analyze multiple qualitative and quantitative measures and highlight how these evolve from SFT through RL; we find several SFT-checkpoint metrics -- spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. Finally, we ground our results with an experiment measuring \textit{chess information density} in our custom datasets. We release models as well as training data, evaluations, and code that allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model. Code, models, and data are available at https://github.com/lucasdino/lang-chess.