Disentangling MLP Neuron Weights in Vocabulary Space
arXiv:2604.06005v1 Announce Type: new
Abstract: Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spac…