How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
arXiv:2604.17105v1 Announce Type: new
Abstract: Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs’ ability to represent phonologic…