cs.CL

Aligning What LLMs Do and Say: Towards Self-Consistent Explanations

arXiv:2506.07523v3 Announce Type: replace
Abstract: Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its expla…