cs.DC, cs.LG

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

arXiv:2604.06664v1 Announce Type: cross
Abstract: Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent s…