f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
arXiv:2602.05946v3 Announce Type: replace-cross
Abstract: Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled …