RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
arXiv:2604.14885v1 Announce Type: new
Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but exi…