Demo · Module 18 · Interactive

How big does ΔW need to be? Drag the rank.

Full FT: train all of W (d × d params)
LoRA: train A · B where A is d×r and B is r×d
Rank r is the only hyperparameter that matters
LoRA Rank
balanced · sweet spot for most tasks
r =8
1 · minimal 4 8 · sweet spot 16 32 64 128 256 · full
Matrix decomposition
ΔW · full fine-tune--
d × d
A--
d × r
B--
r × d
ΔW ≈ A · B   ·   trainable params = d · r · 2 · 96 attention layers
Comparison · full fine-tune vs LoRA
Full FT 100% --
LoRA r=8 -- --
Model preset

Pick a real-world model to fine-tune. Changes d and the number of attention layers.

What this demo shows. When you fine-tune a pretrained model, instead of updating every parameter in a weight matrix W of shape d × d, LoRA assumes the change ΔW is low-rank — meaning it can be expressed as A · B where A is d × r and B is r × d. You only train A and B; the original W stays frozen. For Llama 3 8B with d=4096 and 32 layers, full fine-tune trains ~16.8M params per attention layer; LoRA with r=8 trains 65,536 params per attention layer — a 256× reduction. Why does this work? Empirically, the change in weights during fine-tuning has very low intrinsic rank — most fine-tuning tasks only need a small subspace of updates. QLoRA goes further: quantize the frozen base model to 4-bit and only the LoRA adapters stay in full precision, cutting VRAM another 4×.