HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
arXiv:2604.20140v1 Announce Type: cross
Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood…