cs.AI, cs.CR

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

arXiv:2604.26511v1 Announce Type: cross
Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods…