Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference
arXiv:2604.23467v1 Announce Type: cross
Abstract: Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch ove…