Why Do Large Language Models Generate Harmful Content?
arXiv:2604.11663v1 Announce Type: new
Abstract: Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to i…