MIDUS: Memory-Infused Depth Up-Scaling
arXiv:2512.13751v2 Announce Type: replace
Abstract: Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE). HML assigns each head a distinct key space, while HIVE realizes head-specific values from a shared latent bank through compact projections. Alongside empirical improvements in performance and efficiency, our head-importance and fixed-retrieval structural analyses characterize HML with HIVE as a structurally distinct, head-conditioned alternative to FFN-based residual expansion.