cs.AI, cs.LG

P^2O: Joint Policy and Prompt Optimization

arXiv:2603.21877v3 Announce Type: replace-cross
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on “hard samples” where all rollouts fail. This lack of…

Scroll to Top