PagedAttention, Speculative Decoding, Multi-LoRA Serving, and the Systems That Turn Trained Weights into Production APIs

Six episodes built a model that scores well on every benchmark that matters. The job is done, but when someone tries to actually use it, the first request takes eight seconds. When there are ten concurrent users and latency doubles. If there are hundred concurrent users, the system crashes. The issue is with infrastructure, not the model. This episode is about the system that sits between trained weights and a user waiting for a response, and why getting it wrong makes everything before it pointless.
Table of Contents
- Why Inference Is a Different Problem Than Training
- KV Cache Memory Management
- Batching, Scheduling, and Request Management
- Speculative Decoding
- Multi-LoRA Serving
- Prefill-Decode Disaggregation
- Quantization for Serving and Structured Output
- The Engine Landscape and Hardware Reality
Chapter 1: Why Inference Is a Different Problem Than Training
Training and inference look like the same computation from a distance. Both run forward passes through the same architecture. Both move tensors through the same attention layers and FFN blocks. But the operational profile is fundamentally different, and the engineering constraints that dominate each phase have almost nothing in common. Training is a throughput game, but Inference is a latency game. The techniques that maximize one actively hurt the other.
1.1 Prefill vs Decode: Two Phases, Two Bottlenecks
The prefill phase processes the entire input prompt in a single parallel operation. All tokens are available simultaneously, so the GPU can execute dense matrix-matrix multiplications across the full sequence. This phase populates the initial KV cache, the data structure that stores computed key and value vectors so they do not need to be recomputed on subsequent steps. Prefill is compute-bound. The limiting factor is the raw floating-point operations per second [FLOPS] the GPU can deliver. More FLOPS means faster prefill.
Once the first output token is generated, the model enters the decode phase. This phase is strictly autoregressive. Each new token depends on every token before it, so generation proceeds one token at a time. Every single token generation step requires reading the entire model weight matrix and the growing KV cache from High Bandwidth Memory [HBM] to the GPU’s compute units. The arithmetic per byte of data moved is tiny, a single matrix-vector multiply per layer rather than the matrix-matrix multiplies of prefill. The GPU’s compute cores sit partially idle, waiting for data to arrive from memory.
Decode is memory-bandwidth-bound. The speed at which tokens are generated is not determined by how fast the GPU can multiply, but by how fast it can read. This is the single most important fact about LLM inference, and almost every optimization in this article exists because of it.
1.2 The KV Cache as the Central Cost
The KV cache stores the key and value vectors for every token in the current context, across every layer of the model. Without it, each new token would require recomputing K and V for every preceding token from scratch, turning an already sequential process into a quadratically expensive one. The cache eliminates that redundancy.
The cost is memory. KV cache size scales with sequence length, batch size, number of KV heads, head dimension, and layer count. For a model like Llama 3 or Qwen3, a single long-context request can consume gigabytes of VRAM just for its cached key-value pairs. Scale that to dozens or hundreds of concurrent users generating long responses simultaneously, and the KV cache can easily exceed the memory footprint of the model weights themselves.
This creates an inversion that surprises most practitioners coming from training. During training, model weights dominate memory. During inference at scale, the intermediate state per request dominates. The model fits comfortably. The requests do not.
1.3 The Memory Bandwidth Wall
The decode phase performs very little arithmetic relative to the amount of data it moves. Each token generation reads the full weight matrix and the accumulated KV cache, performs a relatively small matrix-vector multiplication per layer, and writes the result. The ratio of compute to memory access, called arithmetic intensity, is extremely low.
Modern GPUs are designed for high arithmetic intensity workloads. When arithmetic intensity is low, the compute units finish their work and stall, waiting for the next chunk of data to arrive from memory. The GPU’s theoretical compute capability becomes irrelevant. The binding constraint is how many bytes per second can move from HBM to the streaming multiprocessors.
This is why inference hardware discussions focus on memory bandwidth and HBM capacity rather than peak FLOPS. And it is why every technique in the chapters that follow, PagedAttention, speculative decoding, prefill-decode disaggregation, is ultimately an answer to the same question. How do we either reduce the amount of data that needs to move, or move it faster?
Chapter 2: KV Cache Memory Management
The KV cache is the dominant memory cost at inference time, and decode speed is bound by memory bandwidth. The natural next question is how inference systems actually manage that memory. The answer, for the first generation of serving engines, was “badly.”
2.1 The Fragmentation Problem
Early inference engines allocated KV cache memory the same way a naive operating system allocates process memory. Each incoming request received a contiguous block of VRAM sized for the maximum possible sequence length. A system configured for 4,096 tokens would reserve a full 4,096-token KV buffer for every request, regardless of whether the actual response ended up being 50 tokens or 3,000.
The waste was severe and came in two forms. Internal fragmentation occurred when the reserved block was larger than the actual sequence, leaving allocated but unused memory sitting idle for the lifetime of the request. External fragmentation occurred when completed requests freed their blocks, leaving the remaining VRAM scattered in chunks too small to fit new allocations. The combined effect meant that naive systems achieved roughly 20 to 40% memory utilization. On a GPU with 80GB of HBM, the majority of the memory dedicated to KV storage was simply wasted at any given moment.
This is the problem that made high-concurrency serving economically impractical before 2023. The model fit in memory. A handful of requests fit alongside it. But scaling to hundreds of concurrent users was impossible not because of compute limits, but because of how memory was managed.
2.2 PagedAttention: Virtual Memory for GPUs
PagedAttention partitions the KV cache into small, fixed-size, non-contiguous pages. Instead of reserving a monolithic block per request, the system allocates pages dynamically as the generation progresses. When a request produces its fifth token, it gets a fifth page. When it finishes, those pages are immediately freed and available for the next request.
By eliminating the contiguity constraint, PagedAttention pushed memory utilization from the 20 to 40% range up to over 96% in optimized deployments. That single change translated into 2 to 4x higher serving throughput compared to standard HuggingFace Transformers implementations running the same model on the same hardware.
PagedAttention is the foundational mechanism behind vLLM and the primary reason it became the default serving engine for production LLM deployments.
2.3 RadixAttention: Prefix-Aware Caching
PagedAttention treats every request as independent. That is a correct default, but it misses an obvious redundancy in real production workloads. Multi-turn chatbots reuse the same system prompt for every message. Few-shot pipelines prepend identical examples to every request. Agentic workflows share long instruction prefixes across tool calls. In all of these cases, hundreds of requests are computing and storing identical KV pairs for the same prefix tokens, independently, wasting both compute and memory.
SGLang introduced RadixAttention to exploit this. Instead of managing pages in a flat list, SGLang organizes the entire KV cache in a radix tree. When multiple requests share a common token prefix, the system identifies the overlap and reuses the cached KV pairs directly from the tree. The shared prefix is computed once and stored once. Every subsequent request that starts with the same prefix skips the prefill computation for those tokens entirely.
The throughput gains on prefix-heavy workloads are substantial, up to 6.4x compared to systems that treat each request independently. For multi-turn chat specifically, RadixAttention maintains cache hit rates between 75% and 95%, meaning the vast majority of prefill computation is avoided entirely. When memory pressure increases, SGLang applies a Least Recently Used [LRU] eviction policy, pruning the least valuable branches of the tree first to stay within HBM limits.
2.4 The 80% Utilization Ceiling
Even with PagedAttention or RadixAttention managing memory efficiently, a surprising operational constraint limits how much VRAM can actually be used. Empirical testing across all major inference engines has shown that allocating more than 80 to 90% of a GPU’s VRAM for model weights and KV cache frequently leads to immediate system crashes during startup or under load.
The cause is not GPU memory exhaustion in the usual sense. It is host-side system RAM running out during CUDA Graph compilation. CUDA Graphs reduce CPU-to-GPU launch overhead by pre-recording sequences of GPU operations. But the graph capture process itself requires significant temporary host memory to manage internal dependencies and operation descriptors. When VRAM allocation is pushed too high, the remaining headroom is insufficient for graph compilation metadata, kernel workspace, and dynamic adapter loading.
The practical rule for stable production deployments is to budget 80% of VRAM as usable capacity and treat the remaining 20% as reserved headroom. This is not a theoretical guideline, but an operational constraint that applies across vLLM, SGLang, and LMDeploy alike.
Chapter 3: Batching, Scheduling, and Request Management
The memory management layer is in place. But efficient allocation alone does not guarantee efficient utilization. A GPU with perfectly managed memory can still sit idle if the scheduling logic feeding it requests is poorly designed.
3.1 Static vs Continuous Batching
The first generation of inference servers used static batching. A batch of requests was assembled, all requests in the batch were processed together, and no new request could enter until the entire batch completed. The problem is obvious in hindsight. Requests in a batch rarely finish at the same time. A request generating 20 tokens finishes long before a request generating 500 tokens, but under static batching it holds its slot in the batch until the longest request completes. The GPU continues processing the finished request’s padding tokens, and the queue of waiting requests grows longer for no reason.
The result is tail latency. Short requests are held hostage by long ones. Throughput drops because the batch is effectively running at the speed of its slowest member.
Continuous batching eliminates this by operating at the iteration level rather than the batch level. At every single token generation step, the scheduler evaluates the batch. If a request has produced its end-of-sequence token or hit its length limit, it is immediately removed. If a request is waiting in the queue and there is capacity, it is inserted into the active batch on that same step. The GPU never processes a completed request for even one unnecessary iteration, and new requests begin generation as soon as a slot opens.
3.2 Implementation Differences
The scheduling logic itself has a cost. Every iteration, the engine must evaluate which requests are complete, which are waiting, whether memory is available for new admissions, and how to repack the batch. In Python-based engines, this scheduling overhead is small relative to the GPU computation time per step for large models. But as models get faster, hardware gets more capable, and batch sizes grow, the scheduler’s per-iteration cost becomes a meaningful fraction of the total step time.
LMDeploy addresses this through its TurboMind engine, a pure C++ implementation that removes the Python interpreter from the scheduling hot path entirely. By handling batch management, memory allocation decisions, and request lifecycle tracking in compiled code, TurboMind achieves microsecond-level scheduling precision. Benchmarks show that SGLang and LMDeploy can deliver up to 29% higher raw throughput than vLLM on H100 GPUs, and a significant portion of that gap comes from reduced scheduling overhead rather than any difference in the underlying model computation.
This does not mean vLLM is the wrong choice. vLLM’s Python-based scheduler trades raw speed for flexibility, broader hardware support, and a larger ecosystem. The 29% throughput gap matters most in high-concurrency, latency-sensitive deployments on high-end hardware. For many production workloads, the maturity and compatibility advantages of vLLM outweigh the scheduling overhead. The point is that the scheduler is not free, and at scale, how a system manages the space between requests matters as much as how it manages memory.
Chapter 4: Speculative Decoding
Even-though memory is managed and requests are scheduled, one fundamental constraint remains untouched. Autoregressive decoding generates one token per forward pass. Each token requires reading the full model weights and the accumulated KV cache from HBM. For a 70B parameter model, that is tens of gigabytes of data moved per token, regardless of how well memory is managed or how efficiently requests are batched.
4.1 The Core Idea
Most tokens in a generated sequence are predictable. The word after “the capital of France is” is almost certainly “Paris.” The closing bracket after a JSON key-value pair is guaranteed. A large, expensive model does not need to spend a full forward pass confirming what a much smaller, cheaper model could have predicted with high confidence.
Speculative decoding exploits this by splitting generation into two stages. A lightweight draft mechanism proposes multiple tokens ahead in a single burst. The full target model then verifies all proposed tokens in one parallel forward pass. Verification is cheap because the target model can check N draft tokens simultaneously, the same way prefill processes an entire prompt in parallel. If all draft tokens are accepted, the system has generated N tokens for the cost of one target forward pass plus one cheap draft pass. If some tokens are rejected, generation falls back to the target model’s prediction at the first point of disagreement.
The key mathematical property is that the verification step preserves the target model’s output distribution exactly. The accepted tokens are statistically identical to what the target model would have produced on its own. Speculative decoding is not an approximation. It is a scheduling trick that trades cheap compute for fewer expensive memory-bound steps.
4.2 Traditional Draft-Target Architecture
The straightforward implementation uses a separate, smaller model as the draft model. A 1B parameter model drafts for a 70B parameter model, for example. The draft model runs its own autoregressive loop to propose a sequence of candidate tokens, then the target model verifies them all at once.
This works, but it has real limitations. The draft model is a completely independent model with its own weights, its own vocabulary, and its own learned distribution. It has no access to the target model’s internal representations. It can only guess what the target model would produce based on its own, much weaker understanding of the sequence. Acceptance rates in practice tend to fall in the 40 to 60% range, meaning roughly half the proposed tokens get rejected and the system falls back to standard decoding at those positions. The resulting speedup typically lands between 2 to 3x over vanilla autoregressive generation.
There is also operational overhead. The draft model needs its own VRAM allocation, its own KV cache, and careful synchronization with the target model’s generation loop.
4.3 EAGLE-3: Integrated Prediction Heads
EAGLE-3 takes a fundamentally different approach. Instead of using a separate model to draft, it attaches a lightweight, autoregressive prediction head directly to the target model’s internal layers. This head is not an independent model. It is a small neural network that consumes the target model’s own hidden states as input.
The difference matters because of what those hidden states contain. A separate draft model sees only the token sequence and must reconstruct the target model’s understanding from scratch. The EAGLE head sees the target model’s actual internal representations, the rich, multi-layer semantic embeddings that encode everything the model has computed about the sequence so far. EAGLE-3 specifically uses multi-layer feature fusion, integrating low-level, mid-level, and high-level embeddings from the target model’s hidden layers.
The result is a substantial improvement in draft quality. The EAGLE-3 paper reports speedups of approximately 3.0x to 6.5x over vanilla autoregressive generation, with a 20 to 40% improvement over EAGLE-2. On a Vicuna 13B model, the measured speedup reached 5.6x over vanilla decoding and 1.8x over the original EAGLE-1. Actual acceptance rates and speedups are task-dependent, with code generation and templated outputs seeing higher gains and mathematical reasoning seeing lower ones. This task sensitivity is inherent to all speculative decoding methods, but EAGLE-3’s access to the target model’s internal features keeps its acceptance rates consistently higher than any independent draft model can achieve.
EAGLE-3 also replaces single-sequence verification with a dynamic candidate tree. Rather than proposing one linear sequence of draft tokens, the head generates a tree of possible continuations and the target model verifies the entire tree in a single pass. This increases the probability that at least one path through the tree matches the target distribution.
The gap between traditional speculative decoding and EAGLE-3 illustrates a broader principle. The most effective inference optimizations are not the ones that work around the model. They are the ones that work with the model’s own internal structure.
Everything covered so far, memory management, scheduling, speculative decoding, applies to serving a single model. But practitioners who have followed this series through six episodes of fine-tuning are likely not serving one model. They are serving dozens or hundreds of LoRA variants. Chapter 5 covers what happens when multi-adapter serving meets the KV cache, and why the interaction between the two creates a problem that none of the solutions discussed so far can handle alone.
Chapter 5: Multi-LoRA Serving
Everything covered so far applies to serving a single model. But practitioners who have followed this series through six episodes of fine-tuning are likely not serving one model. They are serving dozens or hundreds of LoRA variants.
A company with a customer support model, a code review model, a summarization model, and a translation model has four distinct fine-tuned variants. Keeping a full copy of a 7B or 70B model in VRAM for each variant is economically impossible. Multi-LoRA serving exists to solve this.
5.1 The Architecture
Multi-LoRA serving keeps the base model weights frozen in VRAM as a shared resource. When a request arrives tagged for a specific adapter, the engine loads that adapter’s A and B matrices and applies them to the relevant projection layers during the forward pass. The adapter weights are tiny relative to the base model, typically a few hundred megabytes for a 7B model, so swapping between them is fast and memory-efficient.
This architecture allows a single GPU to serve hundreds of distinct fine-tuned variants from a single base model footprint. The memory cost per additional variant is just the adapter weights, not a full model copy.
5.2 The Inter-LoRA Interference Problem
The architecture is clean in theory. In practice, it creates a dependency problem that current serving engines handle poorly.
The KV cache and the LoRA adapter are coupled. A KV cache entry is only valid for the specific adapter that produced it, because the adapter modifies the Q, K, and V projection matrices. A cached key-value pair computed with Adapter A applied to the attention layers is not a valid cache entry for Adapter B. The two adapters produce different projections, different attention patterns, and different value content.
When the system is under memory pressure and needs to evict an adapter to make room for another, it swaps the adapter out of VRAM. But the KV cache entries that adapter produced often remain resident. When a new request for the evicted adapter arrives, the engine finds those cache entries but cannot use them. The adapter is gone. The cache is “invalid,” occupying memory but serving no purpose.
Experimental data shows that vLLM can reach an invalid KV cache rate of up to 46.5% in high-churn multi-LoRA workloads. Nearly half of the KV cache memory is occupied by entries that cannot be used. This directly reduces the number of concurrent requests the system can serve and inflates Time to First Token (TTFT) for requests whose adapters need to be reloaded.
5.3 Dependency-Aware Caching (FastLibra)
FastLibra, also referred to as ELORA in the research literature, was designed specifically to solve this coupling problem. The core idea is to stop treating adapters and KV caches as separate resources and instead manage them as a single, dependency-linked structure.
FastLibra implements a unified memory pool where both LoRA adapter weights and KV cache blocks exist in a shared tree-based data structure. Each KV cache block is logically linked to the adapter node that produced it. When the system needs to evict memory, it evaluates adapter-cache pairs together rather than independently. If an adapter is evicted, its dependent KV blocks are evicted with it. No orphaned cache entries. No invalid memory.
The eviction decision itself is driven by a unified cost model that estimates the impact of each eviction on the TTFT of future queries. An adapter that is likely to be requested again soon has a higher retention value. An adapter that has not been requested recently and whose KV cache is consuming significant memory gets evicted as a unit.
FastLibra reduces TTFT by an average of 63.4% compared to state-of-the-art baselines and achieves an average peak throughput of 1.7x over vLLM across tested scenarios.
Every optimization discussed so far operates within a single GPU. Memory management, scheduling, speculative decoding, and multi-LoRA serving all assume the prefill and decode phases share the same hardware. Chapter 6 examines what happens when that assumption is broken deliberately, separating the two phases onto different physical machines to unlock a new scaling dimension entirely.
Chapter 6: Prefill-Decode Disaggregation
Every system described so far runs both phases of inference on the same GPU. Chapter 1 established that prefill is compute-bound and decode is memory-bandwidth-bound. These two phases have opposite hardware preferences. Prefill wants maximum FLOPS. Decode wants maximum memory bandwidth and HBM capacity. When both run on the same GPU, neither gets exactly what it needs. At moderate concurrency, the compromise is acceptable. At high concurrency with long contexts, it breaks.
6.1 Why Collocated Serving Breaks at Scale
The failure mode is concrete. A user submits a request with a 50,000-token prompt. The prefill phase processes all 50,000 tokens in a single parallel pass, saturating the GPU’s compute units for several seconds. During those seconds, every other request currently in its decode phase stalls. No tokens are generated for any of them. The users behind those requests experience a sudden spike in Inter-Token Latency [ITL], the time between consecutive tokens in their response.
From a throughput perspective, the system might look healthy. Total tokens per second across all requests may still be high. But from a user experience perspective, the service is broken. A chatbot that pauses for three seconds mid-sentence because another user submitted a long document is unusable regardless of what the aggregate metrics say.
The root cause is resource contention. Prefill and decode compete for the same compute units, the same memory bandwidth, and the same scheduling slots. As context lengths grow toward hundreds of thousands of tokens, the duration and severity of these interference events grow with them.
6.2 Separating Phases onto Specialized Hardware
Prefill-decode disaggregation solves this by physically separating the two phases onto different GPU clusters, each optimized for its specific workload profile.
Prefill clusters are built around GPUs with maximum FLOPS, such as H100 or B200 accelerators. Their job is to process incoming prompts and generate the initial KV cache as fast as possible. These nodes use aggressive tensor parallelism, splitting the matrix multiplications of a single forward pass across multiple GPUs to minimize Time to First Token.
Decode clusters are built around GPUs with maximum HBM capacity and memory bandwidth. Their job is to store the KV caches of all active requests and generate tokens with minimal latency. These nodes can use pipeline parallelism or replica scaling to maximize total token throughput across concurrent requests.
A brief note on the parallelism distinction. Tensor parallelism splits a single operation (one matrix multiply) across multiple GPUs, reducing the latency of that operation at the cost of high inter-GPU communication. Pipeline parallelism splits different layers of the model across GPUs, allowing multiple requests to be processed simultaneously at different stages. Tensor parallelism optimizes latency per operation, while pipeline parallelism optimizes throughput across operations. Prefill benefits from the former because it needs one large computation done fast. Decode benefits from the latter because it needs many small computations done concurrently.
The separation allows each cluster to be scaled independently based on workload characteristics. An application with long input prompts but short outputs needs more prefill capacity. An application with short prompts but long generated responses needs more decode capacity. Disaggregation lets each scale to match the actual demand.
6.3 KV Cache Transfer
The engineering challenge in PD disaggregation is the handoff. Once the prefill cluster generates the KV cache for a request, that cache must be transferred to a decode node before token generation can begin. The KV cache for a long-context request can be gigabytes in size. Any latency in this transfer adds directly to TTFT.
High-performance communication protocols make this viable. The NVIDIA Inference Xfer Library enables fast tensor transfers across high-speed InfiniBand or RoCE networks. Remote Direct Memory Access allows the prefill GPU to write directly into the memory of the decode GPU, bypassing the host CPU and operating system network stack entirely. The transfer happens GPU-to-GPU with minimal software overhead.
For extremely long context tasks, where even the decode cluster’s HBM cannot hold all active KV caches simultaneously, engines like SGLang and vLLM are implementing tiered storage models. “Hot” KV blocks that are actively being decoded stay in HBM. “Warm” blocks from recently paused or lower-priority requests move to host RAM. “Cold” blocks from idle sessions page out to NVMe SSDs via GPUDirect Storage, which provides a direct data path between the GPU and the storage device without CPU involvement.
This tiered approach extends the effective KV cache capacity far beyond what HBM alone can support, at the cost of increased retrieval latency when a cold request resumes. The tradeoff is acceptable for most production workloads because the alternative, rejecting the request entirely due to memory limits, is worse.
Chapter 7: Quantization for Serving and Structured Output
The previous chapters treated the model as a fixed object. Weights are loaded, requests flow through them, and the infrastructure around the model determines performance. But the model is not a fixed object. Decisions made during training, specifically about numerical precision, directly determine what the serving infrastructure can and cannot do.
7.1 Precision Consistency: Training to Serving
Episode 5 of this series covered quantization during training. QLoRA loads base model weights in 4-bit NF4 format, trains LoRA adapter matrices in BF16, and keeps normalization layers in FP32. At the end of training, the adapter matrices are merged into the base weights to produce a single set of model parameters. The precision of those merged weights is the starting point for every serving decision that follows.
The merge step itself is where the first mismatch can occur. The base weights exist in NF4 during training, but the merge operation dequantizes them back to BF16 or FP32 before adding the adapter contribution. The merged output is typically saved in BF16. That BF16 checkpoint is the artifact that gets handed to the serving team. What happens next, whether those weights are served in BF16, re-quantized to FP8, or compressed further to INT4, determines both the accuracy and the throughput of the deployed model.
The critical point is that these decisions are not independent. A model trained with QLoRA at 4-bit precision, merged to BF16, and then re-quantized to INT4 for serving has been through two rounds of precision reduction. The accuracy impact of the second quantization compounds with any information loss from the first. Practitioners who treat training precision and serving precision as separate concerns consistently produce models that perform worse in production than their evaluation numbers predicted.
7.2 FP8 vs INT4 for Production
FP8 has emerged as the standard precision format for production LLM serving. It provides a 2x reduction in memory footprint compared to BF16 while maintaining near-perfect accuracy across most benchmarks.
INT4 quantization provides a 4x memory reduction, which is the difference between needing an 80GB A100 and fitting on consumer-grade hardware for a 70B model. The tradeoff is measurable accuracy loss. General benchmarks show a 1 to 3% drop, but the impact is uneven across tasks. Precise tasks like code generation can suffer up to an 8% accuracy reduction, because code has less redundancy than natural language and small numerical errors in weight values produce syntactically invalid output more frequently.
7.3 Structured Output with Constrained Decoding
A different kind of serving challenge appears at the output end of the pipeline. As LLMs are integrated into programmatic workflows, the model’s output frequently needs to conform to a strict schema. A JSON object with specific fields. An XML response matching an API contract. A function call with typed arguments.
The standard approach is guided decoding. At each generation step, the engine examines the partially generated output, determines which tokens are valid continuations according to the target schema, and masks the model’s logits to zero out all invalid tokens before sampling.
The problem with naive guided decoding is cost. Determining the valid token set at each step requires running a grammar or schema validator against the current partial output, which is a CPU-bound operation that must complete before the GPU can proceed with the next token.
SGLang addresses this with a compressed finite state machine [FSM] mechanism. When a schema is defined, for example via a Pydantic model, SGLang compiles the structural constraints into an FSM at request initialization time, not at each generation step. The FSM pre-computes which token transitions are valid at each state, and when the grammar allows only a single valid continuation, such as a closing brace or a required field name, the engine skips the generation step entirely and emits the deterministic tokens in a single batch. This “jump-forward” encoding avoids redundant KV cache computations for tokens whose identity was never in question.
The result is structured output generation that runs up to 3x faster than standard generation followed by parsing and validation.
Chapter 8: The Engine Landscape and Hardware Reality
Seven chapters built the individual components of the inference stack. This chapter answers the question that all of those components lead to. What do practitioners actually deploy, on what hardware, and how do the decisions made during training constrain what is possible at serving time?
8.1 Engine Selection Framework
The inference engine market has consolidated into two tiers, and the choice between them depends on the workload profile rather than any universal ranking.
Tier 1: Data-Center Engines
vLLM is the production standard. It has the broadest model architecture support, the largest ecosystem of integrations, and the most mature operational tooling. Its integration with Kubernetes and the Red Hat AI Inference Server has made it the default for enterprise deployments where stability, compatibility, and the ability to run on diverse hardware matter more than extracting the last percentage of throughput.
SGLang is the stronger choice for workloads that are agentic, multi-turn, prefix-heavy, or require structured output. Its native RadixAttention delivers 75 to 95% cache hit rates on multi-turn chat, maintaining stable throughput of 30 to 31 tokens per second under high concurrency where vLLM’s performance can degrade as cache hit rates fall. Its compressed FSM mechanism makes structured JSON generation up to 3x faster. For teams building complex LLM programs with shared prefixes and structured outputs, SGLang is worth the smaller ecosystem.
Tier 2: Specialized Engines
LMDeploy occupies a narrow but important niche. Its TurboMind C++ engine delivers up to 29% higher raw throughput than vLLM on H100 GPUs. For deployments where maximum tokens per second on high-end NVIDIA hardware is the primary objective, LMDeploy is the fastest option available.
llama.cpp serves a completely different need. It is the standard for edge inference and CPU-bound deployment, running quantized models on consumer hardware without GPU requirements.
8.2 Hardware: Where Bandwidth Meets Economics
The H100 SXM provides 80GB of HBM3 memory with 3.35 TB/s of bandwidth and NVLink 4.0 delivering 900 GB/s of GPU-to-GPU interconnect. The B200 increases HBM capacity to 192GB of HBM3e with 8 TB/s of bandwidth. That is a 2.4x increase in the resource that directly determines decode speed.
For models exceeding a single GPU’s memory, the interconnect speed becomes the binding constraint. Tensor parallelism splits matrix operations across GPUs, and the results must be synchronized after every layer. NVLink’s 900 GB/s bandwidth on H100 is what makes this synchronization fast enough to be practical. The Blackwell generation pushes interconnect capabilities further to support rack-scale expert parallelism for Mixture-of-Experts models like DeepSeek-R1, where different GPUs host different experts and exchange activations at terabit speeds.
The economic dimension compounds at scale. The difference between 12,500 tokens per second and 16,215 tokens per second on a single H100 is a 29% throughput advantage. At data-center scale, that translates into tens of thousands of dollars in monthly GPU cost savings. The engine choice is not just a technical decision. It is a cost structure decision.
8.3 The Decision That Connects Training to Serving
The most expensive mistakes in inference are not made at deployment time. They are made during training, by teams that treat training and serving as independent problems.
max_seq_length set during fine-tuning determines the maximum context the model is trained to handle. That same length determines the KV cache memory requirement at inference time. A model fine-tuned at 8,192 tokens cannot be served efficiently in an environment sized for 2,048-token KV caches. Conversely, fine-tuning at 32,768 tokens when the production workload never exceeds 4,096 wastes VRAM on KV cache capacity that no request will ever use.
The quantization format chosen during training constrains serving precision options. A model trained with QLoRA in NF4, merged to BF16, and then re-quantized to INT4 for serving has been through two lossy compression steps. Teams that plan the training-to-serving precision pipeline as a single decision, choosing the final serving format before training begins, consistently produce models that perform closer to their evaluation numbers in production.
The engine choice itself should inform training decisions. If the target deployment uses SGLang with RadixAttention, the system prompt and few-shot prefix should be designed for maximum reuse across requests, because every shared prefix token is a prefill computation that gets cached and never repeated. If the deployment uses multi-LoRA serving, the adapter rank and target modules should be chosen with awareness of the VRAM budget for adapter loading and the KV cache validity implications covered earlier.
Seven episodes built a model from architecture through training through deployment. The transformer is no longer a black box, The training pipeline is no longer a set of copied hyperparameters, And the inference stack is no longer someone else’s problem. Every component in the system, from the embedding layer to the KV cache eviction policy, has a derivable reason for existing and a concrete consequence when configured incorrectly. That is the foundation everything else builds on.
Connect with me at https://www.linkedin.com/in/suchitra-idumina/
LLM Inference Infrastructure from Scratch: How to Fine-Tune Correctly, Part 7 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.