Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
arXiv:2603.11321v2 Announce Type: replace-cross
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimiza…