cs.AI, cs.CL, cs.LG

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

arXiv:2603.07084v2 Announce Type: replace
Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging be…