Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
arXiv:2511.01202v3 Announce Type: replace-cross
Abstract: Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomical computational resources and massive data regimes. A rigorous theoretical elucidation of LLMs -- their foundational "first principles" -- remains profoundly elusive. To systematically dismantle this epistemological black box, this treatise architects a comprehensive *semantic information theory*, rigorously synthesized from the profound intersections of statistical physics, continuous signal processing, and classical information theory. The cardinal axiom of our theoretical framework is a fundamental ontological paradigm shift: transcending the classical *BIT* -- a microscopic substrate entirely devoid of semantic content -- in favor of the macroscopic *TOKEN* as the irreducible atomic carrier of meaning and reasoning. Ultimately, this unified theoretical edifice not only comprehensively demystifies the generative mechanics and emergent causal capabilities of LLMs but also establishes an impregnable mathematical scaffold to guide all future theoretical inquiries and next-generation architectural paradigms.