attention-mechanism, KV Cache, llm, llm-inference, Machine Learning

Why LLM Inference Slows Down with Longer Contexts

A systems-level view of how long contexts shift LLM inference from compute-bound to memory-boundYou send a prompt to an LLM, and at first everything feelsĀ fast.Short prompts return almost instantly, and even moderately long inputs do not seem to cause …