Characterizing the Consistency of the Emergent Misalignment Persona
arXiv:2604.28082v1 Announce Type: new
Abstract: Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation b…