cs.CL

Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic

arXiv:2510.17001v2 Announce Type: replace
Abstract: Large language models (LLMs) often encode word-form variation (e.g., walk vs. walked) as linear directions in the embedding space. However, standard tokenization algorithms treat such variants as dis…