cs.CL, cs.IR

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

arXiv:2604.10665v1 Announce Type: new
Abstract: HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary o…