Author name: Shubham Kumar, Narendra Ahuja

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Shubham Kumar, Narendra Ahuja / May 5, 2026

arXiv:2605.00123v1 Announce Type: new
Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, fu…

cs.AI, cs.CV, cs.LG

Measuring the (Un)Faithfulness of Concept-Based Explanations

Shubham Kumar, Narendra Ahuja / March 31, 2026

arXiv:2504.10833v4 Announce Type: replace
Abstract: Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-und…