Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
arXiv:2604.16656v1 Announce Type: new
Abstract: All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many lan…