DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
arXiv:2605.03327v1 Announce Type: cross
Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse graine…