cs.DC, cs.LG

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

arXiv:2604.20819v1 Announce Type: new
Abstract: The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. …