Overtrained, Not Misaligned
arXiv:2605.12199v1 Announce Type: new
Abstract: Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most…