Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
arXiv:2505.04842v2 Announce Type: replace
Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hind…