Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
arXiv:2605.08037v1 Announce Type: cross
Abstract: Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. Howeve…