LocalLLaMA

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algor…