Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic
arXiv:2510.17001v2 Announce Type: replace
Abstract: Large language models (LLMs) often encode word-form variation (e.g., walk vs. walked) as linear directions in the embedding space. However, standard tokenization algorithms treat such variants as dis…