Sam Herring, Jake Naviasky, Karan Malhotra

Targeted Neuron Modulation via Contrastive Pair Search

Sam Herring, Jake Naviasky, Karan Malhotra / May 13, 2026

arXiv:2605.12290v1 Announce Type: new
Abstract: Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade…

Author name: Sam Herring, Jake Naviasky, Karan Malhotra

Targeted Neuron Modulation via Contrastive Pair Search