Why LLM Inference Slows Down with Longer Contexts
A systems-level view of how long contexts shift LLM inference from compute-bound to memory-boundYou send a prompt to an LLM, and at first everything feels fast.Short prompts return almost instantly, and even moderately long inputs do not seem to cause …