Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
arXiv:2604.02686v1 Announce Type: new
Abstract: Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the …