Training-Inference Consistent Segmented Execution for Long-Context LLMs
arXiv:2605.11744v1 Announce Type: new
Abstract: Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and …