Alex Nikulkov - Provide.ai

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Alex Nikulkov / April 28, 2026

arXiv:2604.22981v1 Announce Type: new
Abstract: Reward models in RLHF are trained to score only the final token of a response – a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise…

Author name: Alex Nikulkov

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling