RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
arXiv:2505.02922v3 Announce Type: replace
Abstract: Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) c…