cs.AI, cs.CR, cs.LG

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

arXiv:2604.25891v1 Announce Type: new
Abstract: Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when teste…