Delightful Gradients Accelerate Corner Escape
arXiv:2605.11908v1 Announce Type: new
Abstract: Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions re…