cs.AI, cs.LG

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

arXiv:2604.20140v1 Announce Type: cross
Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood…