EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
arXiv:2604.19485v1 Announce Type: new
Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods su…