Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
arXiv:2604.22981v1 Announce Type: new
Abstract: Reward models in RLHF are trained to score only the final token of a response – a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise…