cs.CL

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

arXiv:2604.27263v1 Announce Type: new
Abstract: Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we de…