Demo · Module 08 · Interactive
The cache that makes inference
actually fast.
K cache0 / 12
V cache0 / 12
Compute per step (FLOPs proxy)
No cache
--
With cache
--
step 0 / 12
What the KV cache actually is. During autoregressive decoding, every new token needs to compute attention against every previous token. The query (Q) for the current token has to dot-product against the keys (K) and aggregate values (V) of all earlier tokens. Without a cache, you'd recompute K and V for every previous token at every new step — quadratic compute in sequence length. With the KV cache, you compute K and V for each token once when it arrives, then keep those vectors in GPU memory; future tokens just compute their own Q and dot-product against the stored Ks. This is the difference between an LLM responding in 200ms and 20s. The cost: cache size grows linearly with context length. For Llama 3 70B with 80 layers × 64 heads × 128 d_head × 2 (K+V) × 2 bytes (fp16) × 1M tokens = ~80 GB just for the cache. This is why long context is expensive — it's not the compute, it's the memory. Modern serving systems like vLLM use PagedAttention to manage the cache like virtual memory, packing many users' caches into the same GPU.