Hamid Kazemi, Atoosa Chegini, Maria Safi

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Hamid Kazemi, Atoosa Chegini, Maria Safi / May 12, 2026

arXiv:2605.08513v1 Announce Type: cross
Abstract: Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful …

Author name: Hamid Kazemi, Atoosa Chegini, Maria Safi

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models