Th\'eo Gigant, Bowen Peng, Jeffrey Quesnelle

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Th\'eo Gigant, Bowen Peng, Jeffrey Quesnelle / May 1, 2026

arXiv:2604.27263v1 Announce Type: new
Abstract: Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we de…

Author name: Th\'eo Gigant, Bowen Peng, Jeffrey Quesnelle

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation