cs.AI, cs.LG

Overtrained, Not Misaligned

arXiv:2605.12199v1 Announce Type: new
Abstract: Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most…