cs.AI, cs.LG

Steered LLM Activations are Non-Surjective

arXiv:2604.09839v1 Announce Type: cross
Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g…