Reward Modeling from Natural Language Human Feedback
arXiv:2601.07349v3 Announce Type: replace
Abstract: Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs ge…