Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
arXiv:2602.03452v2 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but promp…