Leon Eshuijs, Shihan Wang, Antske Fokkens

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

Leon Eshuijs, Shihan Wang, Antske Fokkens / April 15, 2026

arXiv:2604.12500v1 Announce Type: new
Abstract: Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We tr…

Author name: Leon Eshuijs, Shihan Wang, Antske Fokkens

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design