Demo · Module 08 · Interactive

The cache that makes inference
actually fast.

Step through autoregressive decoding
Watch K and V accumulate as tokens stream
Toggle cache off to see compute explode

K cache0 / 12

V cache0 / 12

Compute per step (FLOPs proxy)

No cache -- --

With cache -- --

step 0 / 12

What the KV cache actually is. During autoregressive decoding, every new token needs to compute attention against every previous token. The query (Q) for the current token has to dot-product against the keys (K) and aggregate values (V) of all earlier tokens. Without a cache, you'd recompute K and V for every previous token at every new step — quadratic compute in sequence length. With the KV cache, you compute K and V for each token once when it arrives, then keep those vectors in GPU memory; future tokens just compute their own Q and dot-product against the stored Ks. This is the difference between an LLM responding in 200ms and 20s. The cost: cache size grows linearly with context length. For Llama 3 70B with 80 layers × 64 heads × 128 d_head × 2 (K+V) × 2 bytes (fp16) × 1M tokens = ~80 GB just for the cache. This is why long context is expensive — it's not the compute, it's the memory. Modern serving systems like vLLM use PagedAttention to manage the cache like virtual memory, packing many users' caches into the same GPU.

The cache that makes inferenceactually fast.

The cache that makes inference
actually fast.