cs.LG

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

arXiv:2602.06239v2 Announce Type: replace
Abstract: We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in prefere…