Reading time12-15 minAudionarration availablePrerequisites12SourceTrack A · Gemini
§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 08 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
§ 2
The lesson itself.
Interactive lesson · ported from Gemini trackClick tabs to navigate · hover cards for details
Training happens once — inference runs billions of times
Everything covered so far — pretraining, SFT, RLHF, GRPO — describes how a model is built. But over 90% of the total cost of running an LLM in production is inference: the forward pass that converts your prompt into a response. Every API call, every chatbot reply, every code completion is an inference request. Understanding inference is the difference between a working demo and a scalable product.
The two phases of every LLM response
Phase 1 — Prefill
All prompt tokens are processed in parallel in a single forward pass. The GPU is compute-bound — doing a lot of matrix multiplications simultaneously. Builds the initial KV cache. Fast for short prompts, slow for long ones (10K+ tokens). This is why there is a slight pause before the first word appears.
Phase 2 — Decode
Output tokens are generated one at a time, each requiring a full forward pass. The GPU is memory-bandwidth-bound — loading model weights from memory for each step. Output tokens cost 3–10× more than input tokens. This is why streaming feels different from the initial wait.
>90%
of LLM cost is inference, not training
3–10×
output tokens cost more than input
1 token
generated per decode step — always
The autoregressive bottleneck
Language models generate one token at a time — each new token depends on every previous token. This sequential dependency cannot be parallelised across the output sequence. A 500-token response requires 500 separate decode steps regardless of how many GPUs you have. Inference optimisation is fundamentally about making each decode step as fast and cheap as possible.
The five key techniques
KV cache — don't recompute attention for past tokens. Quantization — shrink model weights to 4-bit or 8-bit. Speculative decoding — draft multiple tokens in parallel. Continuous batching — keep the GPU busy across users. FlashAttention — fuse attention computation on-chip. Together these achieve 10–100× better throughput than naive inference.
Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., UC Berkeley 2023 (SOSP 2023). Introduced vLLM and PagedAttention. Memory waste in KV cache was as high as 80% before this paper. Foundation of modern LLM serving.
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation — Xia et al. 2023. Demonstrated that a small draft model can generate tokens a large model then verifies in one parallel step — achieving 3–4× speedups.
The single most important inference optimisation
KV cache — never recompute attention for tokens you've already seen
During the decode phase, each new token must attend to every previous token. Without a cache, the model recomputes the Key and Value vectors for all past tokens at every single step — an O(N²) cost. The KV cache stores these K and V vectors as they are computed, so each decode step only needs to compute new K and V for the one new token. This turns O(N²) into O(N) — a fundamental speedup.
Interactive animation — watch the KV cache grow token by token
KV cache — each block stores K+V vectors for one token at every layer
Cache memory used: 0 MB(Llama 3 8B: ~2.5 MB per token across all 32 layers)
Without cache vs with cache — attention computation per decode step
No cache (step 100)
Recompute K+V for all 100 past tokens + 1 new = 101 operations
With KV cache (step 100)
Just 1 new token
At step 500: no cache needs 500 operations, KV cache always needs just 1. Linear vs constant per step.
Memory cost of the KV cache
The KV cache is not free. Every token in the context window needs to store K and V vectors for every layer. Llama 3 8B: 32 layers × 2 (K+V) × 8 heads × 128 dims × 2 bytes (BF16) ≈ 2.5 MB per token. A 128K context window = 320 GB of KV cache alone — more than the model weights. This is why long-context inference is so expensive.
PagedAttention & vLLM
Before vLLM (2023), KV cache memory was pre-allocated contiguously — up to 80% was wasted due to fragmentation. PagedAttention divides the cache into fixed-size "pages" (like OS virtual memory), allocating only as needed. This achieved near-zero waste and enabled much larger batch sizes, dramatically cutting cost per token in production.
Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., UC Berkeley / SOSP 2023. Introduced PagedAttention and vLLM. Showed up to 24× higher serving throughput than HuggingFace Transformers.
Quantization — shrinking the model without breaking it
Models trained in 16-bit can be served in 4-bit — with minimal quality loss
A model's weights are floating-point numbers. During training they use BF16 or FP16 (16 bits each). Quantization converts these to lower-precision formats: INT8 (8 bits) or INT4 (4 bits). The result: half or quarter the memory, and faster matrix operations on hardware that supports lower-precision arithmetic. The decode phase is memory-bandwidth-bound — reading weights is the bottleneck — so 4-bit weights can be read up to 4× faster than 16-bit.
Visual: how many bits store one number — click to compare
Memory savings across precision formats — Llama 3 70B
FP32 (full precision)
280 GB — needs 4+ A100 GPUs
BF16 (training default)
140 GB — 2 A100s
INT8
70 GB — 1 A100
INT4 (4-bit)
35 GB — 1 consumer GPU!
GPTQ — post-training quantisation
GPTQ quantizes a model after training with no retraining required. It uses a second-order method to compensate for quantization error layer by layer, finding the INT4 weights that minimize the change in output. Achieves near-FP16 quality at INT4 precision. Standard for running large models on consumer hardware. Used in llama.cpp, ExLlama, and GGUF format.
AWQ — activation-aware quantisation
AWQ (Lin et al. 2024) observes that not all weights are equally important — a small fraction of weights (those connected to large activation values) contribute disproportionately to the model's output. AWQ identifies and protects these "salient" weights from aggressive quantization while compressing the rest. Achieves better quality than GPTQ on most tasks at the same bit width. Default quantization method in vLLM.
The outlier problem — why quantization is hard
Activation outliers are the main challenge
Weights are easy to quantize — they are fixed after training and can be analyzed carefully. Activations are harder — they change with every input and often contain a few very large "outlier" values. A single outlier at value 1,000 in an INT8 range forces the entire activation to be scaled to fit, wasting most of the precision on near-zero values. LLM.int8() (Dettmers 2022) solved this by separating outlier dimensions and keeping them in FP16 while quantizing the rest to INT8 — the first method to achieve good INT8 inference quality on large models.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al. 2022 (arXiv:2210.17323). 4-bit quantization with near-FP16 quality. Foundation of GGUF and consumer LLM serving.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al. 2024 (arXiv:2306.00978). Identifies salient weights via activation analysis. Better quality than GPTQ at same bit width.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Dettmers et al. 2022 (arXiv:2208.07339). First practical INT8 inference for large models. Introduced mixed-precision decomposition for outliers.
Speculative decoding — generate multiple tokens for the cost of one
A small draft model guesses ahead — the large model verifies in parallel
The core problem with autoregressive decoding: the large model must run once per output token. Speculative decoding sidesteps this by using a small, fast draft model to generate k candidate tokens (typically 3–7). The large "target" model then verifies all k tokens in a single parallel forward pass — the same computation as generating one token normally. If the draft was correct, you get k tokens for the cost of 1. If not, you fall back to the target model's correction.
Step-by-step: how one speculative decoding round works
Press play to see speculative decoding
Context (already generated)
Draft model generates k=4 candidate tokens simultaneously…
Large model verifies all 4 in one parallel pass…
Why this works
Verifying k tokens in the large model costs the same as generating 1 token — both process k+1 positions in a single forward pass. On tasks where the draft model is accurate (code completion, factual text), acceptance rates of 70–90% are common, meaning you effectively get 3–5 tokens per large model step instead of 1. Speedups of 2–4× are typical for coding tasks.
Draft model choices
Separate small model: A 1–7B model drafts for a 70B model. Works well for general text. EAGLE / Medusa: A lightweight head on the target model itself drafts future tokens using the hidden states — no separate model needed. n-gram lookup: Ultra-fast draft using repeated phrases from context. Great for code with repetitive patterns. Used by default in vLLM.
2–4×
typical speedup for coding tasks
70–90%
token acceptance rate on domain tasks
k=4
typical draft length (3–7 tokens)
Fast Inference from Transformers via Speculative Decoding — Leviathan et al., Google 2023 (arXiv:2211.17192). Proved that a draft+verify scheme produces identical output distribution to the target model — exact, not approximate. 2–3× speedup on T5 and GPT-2.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. 2024 (arXiv:2401.15077). Auto-regressive draft head trained on the target model's hidden states. 3× speedup without a separate draft model.
Serving LLMs at scale
Continuous batching — keep the GPU busy across hundreds of users
A GPU is only efficient when it is processing many requests simultaneously. Early LLM servers used "static batching" — wait for a full batch of N requests, process them all together, then wait again. The problem: requests finish at different times, leaving the GPU idle while waiting for the slowest request. Continuous batching (also called iteration-level batching or in-flight batching) inserts new requests as soon as any sequence completes — keeping GPU utilisation near 100% continuously.
Static batching vs continuous batching — GPU utilisation over time
Static batching — GPU idles while waiting for slow requests to finish
Req A,B,C processing
GPU idle
Next batch
GPU idle
Continuous batching — new requests fill slots as soon as any request finishes
A
B
D (new)
E
F
GPU stays ≥95% utilised. Continuous batching cuts per-token cost by ~85% at 32 concurrent requests.
The complete inference optimisation stack
1
KV cache + PagedAttention
Never recompute past attention. Allocate cache memory like OS virtual memory — near-zero waste. Prerequisite for everything else.
2
Quantization (INT4/INT8/FP8)
Shrink model from 140 GB to 35 GB. Read weights 4× faster in decode phase. Use AWQ or GPTQ for near-lossless quality.
3
Continuous batching
Serve 32+ users simultaneously. Keep GPU utilisation above 95%. Cuts per-token cost by ~85% vs single-request serving.
4
Speculative decoding
Draft 4 tokens with a small model, verify with the large model in one pass. 2–4× latency improvement for coding and structured outputs.
5
FlashAttention (kernel-level)
Fuse attention computation on-chip to avoid slow GPU memory reads. 2–4× speedup for attention layers. On by default in all major inference engines.
Inference engines in production
vLLM — fastest path to production, PagedAttention, 60s cold start, broad model support. Open source, UC Berkeley.
SGLang — best for shared-prefix workloads (RAG, chatbots). RadixAttention achieves higher cache hit rates.
TensorRT-LLM — peak throughput after compilation (28-min cold start). NVIDIA. 13% faster than vLLM at high concurrency.
Prefix caching
If many requests share the same prefix — a long system prompt, the same RAG document, the same code file — you can compute the KV cache for that prefix once and reuse it for all users. Prefix caching eliminates up to 90% of prefill cost for workloads with shared context. Standard in SGLang (RadixAttention), vLLM (hash-based), and Anthropic's prompt caching API.
Continuous Batching: How vLLM Achieves 24x Higher Throughput — Kwon et al. SOSP 2023. PagedAttention + continuous batching together. Showed 24× throughput improvement over HuggingFace Transformers in serving scenarios.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Dao 2023 (arXiv:2307.08691). 2× speedup over FlashAttention-1. Standard attention kernel in all frontier model training and inference.
§ DEMO
Try it: kv cache visualizer.
Step through autoregressive decoding. Watch K and V fill up token by token. Compute-saved counter shows why caching is worth GBs of VRAM.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.