Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
arXiv:2605.00123v1 Announce Type: new
Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, fu…