LLMs Know They’re Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
arXiv:2604.19117v1 Announce Type: new
Abstract: When a language model agrees with a user’s false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning …