Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation

arXiv:2505.18244v3 Announce Type: replace Abstract: Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but \emph{how architecture shapes information compression}. Analyzing eight Transformer models (7B--70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments -- yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the \textbf{Multi-Scale Probabilistic Generation Theory} (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly confirmed: all eight models exhibit two prominent phase-transition boundaries (P1.1); Llama boundary positions are stable across a $10{\times}$ parameter range ($\mathrm{CV}{=}0.067$--$0.095$) while Qwen positions vary widely ($\mathrm{CV}{=}0.465$--$0.726$), precisely matching our strong- and weak-dominance conditions; and cross-architecture local-segment brittleness spans \textbf{three orders of magnitude} ($493{\times}$ ratio) -- a gap that architecture family alone predicts and that dwarfs any within-family or scale-driven variation.

Leave a Comment