Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang / May 11, 2026

arXiv:2605.07331v1 Announce Type: cross
Abstract: Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the impor…

Author name: Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective