cs.CL

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

arXiv:2510.05837v2 Announce Type: replace
Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize …