Failure Modes of Maximum Entropy RLHF
arXiv:2509.20265v3 Announce Type: replace-cross
Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. M…