cs.CL

Separate Before You Compress: The WWHO Tokenization Architecture

arXiv:2603.25309v1 Announce Type: new
Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers s…