CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
arXiv:2604.14602v1 Announce Type: new
Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annota…