cs.AI, cs.CL, cs.LG

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

arXiv:2605.08513v1 Announce Type: cross
Abstract: Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful …