Building an AI application is the easy part. Serving it at scale — fast, cheap, and reliably — is where the real engineering begins. A single GPT-4 call takes seconds and costs cents. Multiply that by millions of users and you have a latency problem, a cost problem, and an infrastructure problem all at once.
This guide covers the techniques that production teams use to optimize LLM inference: from fundamental concepts like KV Cache to advanced strategies like prefill-decode disaggregation.
The Anatomy of LLM Inference
LLM inference happens in two distinct phases, each with fundamentally different computational characteristics:
Prefill Phase (Compute-Bound)
All input tokens are processed in parallel through the transformer layers. This phase builds the initial KV Cache — the model's "memory" of the conversation so far. It's heavy on computation (matrix multiplications) and benefits from GPU parallelism.
Decode Phase (Memory-Bound)
Tokens are generated one at a time. Each new token requires reading the entire KV Cache from GPU memory. This phase is bottlenecked by memory bandwidth, not compute. The GPU spends most of its time waiting for data to arrive from HBM (High Bandwidth Memory).
Key insight (2025): The vast majority of production inference time is spent in the decode phase, waiting on memory. This is why most optimizations target memory management, not raw computation.
Core Optimization Techniques
| Technique | What It Does | When to Use | |---|---|---| | KV Cache | Stores computed attention key-value pairs; avoids recomputation | Always — it's the baseline | | PagedAttention | Manages KV Cache in pages like OS virtual memory; eliminates fragmentation | Large batch sizes | | Continuous Batching | Injects new requests into running inference; GPU never idles | High-concurrency APIs | | Speculative Decoding | Small draft model proposes tokens, large model verifies in parallel | Chat / agentic apps | | Quantization | Reduces precision (FP32 → INT8/INT4); smaller model, faster inference | Cost optimization / edge | | FlashAttention | Fits attention computation in GPU SRAM; reduces memory I/O | Long-context models |
Let's examine each in detail.
KV Cache: The Foundation
Every transformer layer computes Key and Value tensors during attention. Without caching, generating token N requires recomputing K and V for all N-1 previous tokens — O(N²) work per token.
KV Cache stores these tensors so each new token only needs to compute its own K/V pair and attend to the cached values. This turns O(N²) into O(N) per token.
The catch: KV Cache grows linearly with sequence length and batch size. For a 70B parameter model with 128K context, the KV Cache alone can consume 40+ GB of GPU memory. Managing this memory is what the next several optimizations address.
Without KV Cache: Token 1 → compute K,V for [1] Token 2 → compute K,V for [1,2] Token 3 → compute K,V for [1,2,3] ... Token N → compute K,V for [1..N] ← O(N²) total
With KV Cache: Token 1 → compute K,V for [1], cache it Token 2 → compute K,V for [2], attend to cache Token 3 → compute K,V for [3], attend to cache ... Token N → compute K,V for [N], attend to cache ← O(N) total
PagedAttention
Traditional KV Cache allocates contiguous memory blocks per request. If a request reserves 4K tokens of cache but only uses 500, the rest is wasted. With many concurrent requests, this fragmentation wastes 60–80% of GPU memory.
PagedAttention (introduced by vLLM) borrows from operating system virtual memory management. It splits KV Cache into fixed-size pages (blocks) that can be allocated non-contiguously:
- No pre-allocation waste — pages are allocated on demand
- No fragmentation — pages can be placed anywhere in memory
- Memory sharing — common prefixes (like system prompts) share pages across requests
- Latency: Low (all GPUs work in parallel)
- Communication: High (all-reduce per layer)
- Best for: Latency-sensitive, single-node multi-GPU
- Latency: Higher (sequential pipeline)
- Communication: Low (only between stages)
- Best for: Very large models across multiple nodes
- Latency: Same as single GPU
- Communication: None during inference
- Best for: Throughput scaling with smaller models
- Memory savings: ~50% (compared to FP16)
- Quality loss: Negligible for most models
- Speed improvement: 1.5-2x
- Best for: Production deployments where quality is non-negotiable
- Memory savings: ~75%
- Quality loss: Small but measurable; varies by model
- Speed improvement: 2-3x
- Best for: Cost-constrained deployments, edge inference
- Starting out? Use vLLM — it has the best balance of performance and ease of use.
- NVIDIA-only infrastructure? TensorRT-LLM squeezes every last drop of performance.
- Need to ship fast? TGI with HuggingFace models is the quickest path to production.
- Local development/testing? Ollama gets you running in seconds.
This single optimization increased throughput by 2-4x compared to naive implementations in the original vLLM paper.
Batching Strategies
Batching is the highest-impact first step for improving GPU utilization. Without it, a single request uses a fraction of the GPU's parallel compute capacity.
Static Batching
Collect N requests, process them together, return all results at once. Simple but wasteful — all requests must wait for the longest one to finish.
Best for: Offline workloads, embedding generation, batch ETL.
Dynamic Batching
Accumulate requests within a time window (e.g., 50ms), then process the batch. Better utilization than static, but still has padding waste.
Best for: Near-real-time APIs with moderate traffic.
Continuous Batching
The state of the art. New requests are injected into an in-flight batch as existing requests complete:
1. Requests A, B, C start together 2. Request A finishes (short response) → slot freed 3. Request D immediately fills A's slot while B and C continue 4. GPU never idles between requests
Best for: High-concurrency chat APIs, multi-tenant systems.
# vLLM handles continuous batching automatically from vllm import LLM, SamplingParamsllm = LLM( model="meta-llama/Llama-3-8B-Instruct", max_num_batched_tokens=8192, max_num_seqs=64 # max concurrent sequences )
params = SamplingParams(temperature=0.7, max_tokens=256)
# These are batched automatically with continuous batching outputs = llm.generate([ "Explain KV Cache in one paragraph.", "What is PagedAttention?", "Compare static vs dynamic batching." ], params)
Speculative Decoding
The decode phase is slow because tokens are generated sequentially. Speculative decoding exploits a clever insight: verification is faster than generation.
1. A small draft model (e.g., 1B params) quickly generates K candidate tokens 2. The large target model verifies all K tokens in a single forward pass (parallel) 3. Accepted tokens are kept; rejected tokens are re-generated by the target model
If the draft model has high acceptance rate (70–90%), you get K tokens for roughly the cost of 1-2 target model forward passes. Effective speedup: 2-3x latency reduction with zero quality loss.
Parallelism Strategies
When a model doesn't fit on a single GPU, or when you need more throughput, you distribute the workload:
Tensor Parallelism (TP)
Each transformer layer is split across GPUs. Every GPU holds a slice of every layer and they communicate via all-reduce after each operation.
Pipeline Parallelism (PP)
Different layers go to different GPUs. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, etc. Micro-batching keeps all stages busy.
Data Parallelism (DP)
The same model is replicated across GPUs. Each GPU processes different requests independently.
Expert Parallelism (MoE)
For Mixture-of-Experts architectures: different expert subnetworks run on different GPUs. Each token is routed to only 2-4 experts out of potentially hundreds, keeping per-token compute constant while scaling total model capacity.
Prefill-Decode Disaggregation
A key 2025 technique: run prefill and decode on separate GPU pools.
The insight is that prefill and decode have opposite resource profiles:
| Phase | Bottleneck | GPU Utilization | Ideal Hardware | |---|---|---|---| | Prefill | Compute (FLOPs) | High | High-compute GPUs | | Decode | Memory bandwidth | Low | High-bandwidth, memory-dense GPUs |
When mixed on the same GPU, they interfere: prefill spikes steal compute from decode's latency-sensitive token generation. Disaggregation lets each phase run on hardware optimized for its specific bottleneck.
Results: Lower P99 latency, better cost efficiency, and easier independent scaling of each phase.
Cache Management in Real-Time Systems
Prefix Caching
If every request starts with the same system prompt (common in multi-tenant APIs), you're recomputing the same KV Cache prefix millions of times. Prefix caching computes it once and shares it across all requests.
Impact: 30-50% latency reduction for the prefill phase on repeated system prompts.
KV Cache Offloading
For long-context applications (100K+ tokens), the KV Cache may not fit in GPU memory. Offloading moves older cache segments to CPU memory or even SSD, bringing them back when needed.
Trade-off: Adds latency when accessing offloaded segments, but enables context lengths that would otherwise be impossible.
Quantization: The Speed/Quality Trade-off
Quantization reduces the numerical precision of model weights, shrinking model size and accelerating inference.
INT8 Quantization
INT4 Quantization (GPTQ, AWQ)
# AWQ quantization with vLLM from vllm import LLM, SamplingParams# Load a pre-quantized model llm = LLM( model="TheBloke/Llama-3-8B-Instruct-AWQ", quantization="awq", dtype="half" )
params = SamplingParams(temperature=0.7, max_tokens=256) output = llm.generate("Explain quantization in ML.", params)
Critical: Always measure quantization impact on your specific use case with a representative test set before deploying. Some tasks (math, code generation) are more sensitive to precision loss than others.
Inference Servers
| Server | Key Strength | Best For | |---|---|---| | vLLM | PagedAttention + continuous batching; the open-source standard | General production serving | | TGI (HuggingFace) | Easy deployment, Flash Attention | Quick prototyping to production | | TensorRT-LLM | Maximum NVIDIA GPU optimization | When every millisecond counts | | Ollama | Simple local setup, one-command install | Local development, edge |
Choosing Your Server
Production Metrics
You can't optimize what you don't measure. These are the four metrics every LLM serving system should track:
| Metric | What It Measures | Alert Threshold | |---|---|---| | TTFT (Time to First Token) | Latency perceived by user | > 500ms for chat | | TPS (Tokens per Second) | System throughput | Dropping under load | | Batch Utilization | GPU active time percentage | < 60% sustained | | Memory Pressure | KV Cache fullness ratio | > 90% |
Debugging Performance
When TTFT rises or TPS drops under steady load, check in this order:
1. Memory pressure — Is KV Cache full? Are requests queuing? 2. Batch configuration — Is the batch size too small (GPU underutilized) or too large (memory contention)? 3. Cache misses — Is prefix caching working? Are system prompts being recomputed? 4. Network — For distributed setups, is inter-GPU communication the bottleneck?
Putting It All Together
A production-optimized LLM serving stack in 2025 typically looks like this:
Request → Load Balancer
→ Prefix Cache Check
→ Prefill Pool (high-compute GPUs)
→ KV Cache Transfer
→ Decode Pool (high-bandwidth GPUs)
→ Continuous Batching
→ Streaming Response
Key decisions in order of impact:
1. Enable KV Cache (always) 2. Use continuous batching (vLLM or TGI) 3. Quantize to INT8 (safe, big speedup) 4. Add prefix caching (if using system prompts) 5. Consider speculative decoding (for latency-critical chat) 6. Disaggregate prefill/decode (for large-scale deployments)
Conclusion
LLM inference optimization is not about a single silver bullet — it's about layering complementary techniques. Start with the fundamentals (KV Cache, batching, quantization), measure everything, and add complexity only when metrics justify it. The goal isn't the fastest possible inference; it's the right balance of latency, throughput, cost, and quality for your specific use case.
References
1. Hivenet — Practical Guide to LLM Inference in Production (2025) 2. Daily Dose of DS — A Practical Deep Dive on LLM Inference (Mar 2026) 3. Mirantis — LLM Optimization: Techniques and Guide (Feb 2026) 4. BentoML — 6 Production-Tested Optimization Strategies for High-Performance LLM Inference 5. Redwerk — LLM Inference Optimization Techniques 6. mbrenndoerfer.com — Inference Scaling: Optimizing LLMs for Production Deployment (Oct 2025)
