An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
arXiv:2605.07719v1 Announce Type: cross
Abstract: Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data…