BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
arXiv:2602.00767v2 Announce Type: replace
Abstract: Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behavi…