Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
arXiv:2604.20995v2 Announce Type: replace-cross
Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in par…