RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
arXiv:2605.01831v1 Announce Type: cross
Abstract: Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how t…