I’ve been exploring KV cache optimization beyond Top-K pruning.
Observation: pruning fails *selectively* - a few tokens cause large error spikes.
So I tried:
- entropy (selection)
- OLS (reconstruction)
- SVD (compression)
Early results:
- ~3× lower error at low memory
- avoids error spikes
- sometimes even lower memory
Blog: https://jchandra.com/posts/hae-ols/
Still a prototype - would love feedback, especially where this might break.
[link] [comments]