Demo · Module 18 · Interactive

How big does ΔW need to be? Drag the rank.

Full FT: train all of W (d × d params)
LoRA: train A · B where A is d×r and B is r×d
Rank r is the only hyperparameter that matters

LoRA Rank

balanced · sweet spot for most tasks

r =8

1 · minimal 4 8 · sweet spot 16 32 64 128 256 · full

Matrix decomposition

ΔW · full fine-tune--

d × d

≈

A--

d × r

B--

r × d

ΔW ≈ A · B · trainable params = d · r · 2 · 96 attention layers

Comparison · full fine-tune vs LoRA

Full FT 100% --

LoRA r=8 -- --

Model preset

Pick a real-world model to fine-tune. Changes d and the number of attention layers.

What this demo shows. When you fine-tune a pretrained model, instead of updating every parameter in a weight matrix W of shape d × d, LoRA assumes the change ΔW is low-rank — meaning it can be expressed as A · B where A is d × r and B is r × d. You only train A and B; the original W stays frozen. For Llama 3 8B with d=4096 and 32 layers, full fine-tune trains ~16.8M params per attention layer; LoRA with r=8 trains 65,536 params per attention layer — a 256× reduction. Why does this work? Empirically, the change in weights during fine-tuning has very low intrinsic rank — most fine-tuning tasks only need a small subspace of updates. QLoRA goes further: quantize the frozen base model to 4-bit and only the LoRA adapters stay in full precision, cutting VRAM another 4×.