TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
arXiv:2605.12288v2 Announce Type: replace-cross
Abstract: Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation…