How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
arXiv:2604.04385v2 Announce Type: replace-cross
Abstract: This paper identifies a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost …