Gated Subspace Inference for Transformer Acceleration
arXiv:2605.03109v1 Announce Type: cross
Abstract: A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activatio…