Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
arXiv:2604.23318v1 Announce Type: new
Abstract: Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Pr…