Author name: Inderjeet Nair, Jie Ruan, Lu Wang

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Inderjeet Nair, Jie Ruan, Lu Wang / April 29, 2026

arXiv:2604.20995v2 Announce Type: replace-cross
Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in par…

cs.AI, cs.CL, cs.SE

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Inderjeet Nair, Jie Ruan, Lu Wang / April 24, 2026

arXiv:2604.20995v1 Announce Type: new
Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because …