MachineLearning

Language-model-based compression for Python source using n-grams + arithmetic coding (~33% better than zlib on Flask) [P]

I’ve been experimenting with language-model-based compression for source code, using a simple n-gram model combined with arithmetic coding. ill make a repo soo… The setup is straightforward: tokenize Python source, estimate P(xt∣xt−n+1:t−1)P(x_…