Language-model-based compression for Python source using n-grams + arithmetic coding (~33% better than zlib on Flask) [P]

Language-model-based compression for Python source using n-grams + arithmetic coding (~33% better than zlib on Flask) [P]

I’ve been experimenting with language-model-based compression for source code, using a simple n-gram model combined with arithmetic coding.

ill make a repo soo...

The setup is straightforward: tokenize Python source, estimate P(xt∣xt−n+1:t−1)P(x_t \mid x_{t-n+1:t-1})P(xt​∣xt−n+1:t−1​) using an order-4 n-gram model, and feed those probabilities into an arithmetic coder. The coder converts the predicted distribution into a bitstream, so compression performance is directly tied to how well the model estimates token probabilities.

I tested this on the Flask codebase (~575 KB of .py files):

  • Neural (n-gram) + arithmetic coding → 101 KB (0.176x, 82.4% reduction)
  • zlib / ZIP → 151 KB (0.263x)
  • lzma / 7z → 152 KB (0.265x)
  • zstd → 147 KB (0.256x)

So roughly ~33% better compression ratio than zlib.

This isn’t too surprising in hindsight: general-purpose compressors operate at the byte level, while even a small language model captures strong structure at the token level (e.g., syntactic patterns like def → identifier, return → expression, etc.). Arithmetic coding then translates that predictive structure into bits.

Architecture-wise, the model and tokenizer are implemented in Python, while the arithmetic coder is written in Zig and called via ctypes. The coder consumes a probability vector (f32[]) per token and maintains the range state / bitstream.

The main limitation is speed: encoding is currently ~1600× slower than zlib (~75s vs ~0.05s). The bottleneck is per-token probability computation in Python with no caching or batching.

This is obviously a very simple model (n-grams with small context), so it’s essentially a lower bound on what a stronger model could achieve. It would be interesting to see how far a small LSTM or transformer could push compression ratio vs. the added compute cost, especially if predictions are batched.

Curious if others have explored similar “LM + entropy coding” setups specifically for source code, and whether there are practical approaches to making them competitive on speed as well as ratio.

submitted by /u/Equivalent-Gas2856
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top