Targeted Neuron Modulation via Contrastive Pair Search
arXiv:2605.12290v1 Announce Type: new
Abstract: Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade…