Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
arXiv:2604.08723v1 Announce Type: new
Abstract: Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We as…