cs.AI, cs.LG

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

arXiv:2604.02686v1 Announce Type: new
Abstract: Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the …