Mitigating Misalignment Contagion by Steering with Implicit Traits
arXiv:2605.02751v2 Announce Type: replace
Abstract: Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interac…