cs.CL, cs.LG

Failure Modes of Maximum Entropy RLHF

arXiv:2509.20265v3 Announce Type: replace-cross
Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. M…