cs.CL, cs.LG

Proxy Compression for Language Modeling

arXiv:2602.04289v2 Announce Type: replace-cross
Abstract: Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the mod…