Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
arXiv:2603.07084v2 Announce Type: replace
Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging be…