\"Omer Veysel \c{C}a\u{g}atan, Bar{\i}\c{s} Akg\"un

Failure Modes of Maximum Entropy RLHF

\"Omer Veysel \c{C}a\u{g}atan, Bar{\i}\c{s} Akg\"un / April 30, 2026

arXiv:2509.20265v3 Announce Type: replace-cross
Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. M…

Author name: \"Omer Veysel \c{C}a\u{g}atan, Bar{\i}\c{s} Akg\"un

Failure Modes of Maximum Entropy RLHF