Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
arXiv:2507.15778v2 Announce Type: replace
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mai…