cs.AI, cs.LG

Minimizing Collateral Damage in Activation Steering

arXiv:2605.01167v1 Announce Type: cross
Abstract: Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. Ho…