Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
arXiv:2601.21244v3 Announce Type: replace-cross
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling succe…