Learning to Rank Caption Chains for Video-Text Alignment
arXiv:2603.25145v1 Announce Type: new
Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal f…