Refusal in open-weights models looks like a sparse gate -> amplifier circuit, and generalizes across 12 models from 6 labs (2B-72B)
Paper: https://arxiv.org/abs/2604.04385 I've been trying to understand where refusal actually lives. How it works mechanistically. Arditi et al showed refusal can be steered with a single direction. What I looked at here is the mechanistic question…