cs.LG

Think Outside the Policy: In-Context Steered Policy Optimization

arXiv:2510.26519v3 Announce Type: replace
Abstract: Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of…