Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
arXiv:2604.19147v1 Announce Type: new
Abstract: Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottlenec…