Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
arXiv:2604.12500v1 Announce Type: new
Abstract: Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We tr…