Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov / April 16, 2026

arXiv:2604.14053v1 Announce Type: new
Abstract: Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also pro…

Author name: Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution