| I’ve been experimenting with language-model-based compression for source code, using a simple n-gram model combined with arithmetic coding. The setup is straightforward: tokenize Python source, estimate P(xt∣xt−n+1:t−1)P(x_t \mid x_{t-n+1:t-1})P(xt∣xt−n+1:t−1) using an order-4 n-gram model, and feed those probabilities into an arithmetic coder. The coder converts the predicted distribution into a bitstream, so compression performance is directly tied to how well the model estimates token probabilities. I tested this on the Flask codebase (~575 KB of
So roughly ~33% better compression ratio than zlib. This isn’t too surprising in hindsight: general-purpose compressors operate at the byte level, while even a small language model captures strong structure at the token level (e.g., syntactic patterns like Architecture-wise, the model and tokenizer are implemented in Python, while the arithmetic coder is written in Zig and called via ctypes. The coder consumes a probability vector (f32[]) per token and maintains the range state / bitstream. The main limitation is speed: encoding is currently ~1600× slower than zlib (~75s vs ~0.05s). The bottleneck is per-token probability computation in Python with no caching or batching. This is obviously a very simple model (n-grams with small context), so it’s essentially a lower bound on what a stronger model could achieve. It would be interesting to see how far a small LSTM or transformer could push compression ratio vs. the added compute cost, especially if predictions are batched. Curious if others have explored similar “LM + entropy coding” setups specifically for source code, and whether there are practical approaches to making them competitive on speed as well as ratio. [link] [comments] |