Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
arXiv:2511.01202v4 Announce Type: replace-cross
Abstract: Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a *Semantic Information Theory* at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical *BIT* -- a microscopic substrate devoid of semantic content -- with the macroscopic *TOKEN* as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt *Massey's directed information* as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function* for pre-training, a *directed rate-reward function* for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with *Granger causal inference*, and sharpens the limits of LLM reasoning against *Pearl's Ladder of Causation* -- affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.*