Data-Efficient RLVR via Off-Policy Influence Guidance
arXiv:2510.26491v2 Announce Type: replace
Abstract: Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods a…