Narration · Module 15
Training Tricks
0:00 / 0:00
Module 15 · Training · 10 min

Mixed precision and
gradient tricks.

The optimizations that turn theoretical training into practical training.

Reading time10 min Audionarration available PrerequisitesNone SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 15 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 9 — Training Optimizations

The Smart Mountain Climber

Course File: L9
The "Mountain Climber" Analogy
How does an AI actually "learn"?
Imagine you are blindfolded on a foggy mountain, and your goal is to find the lowest valley (which represents 0 mistakes).

You feel the slope with your feet and take a step downhill. But what if you take a step that is too big? You might jump over the valley and land higher up on the opposite cliff! What if your step is too small? It will take you a million years to get to the bottom. Sometimes there are small "fake" valleys (local minimums) that trap you before you reach the true bottom.
Interactive: Tune the AdamW Optimizer
Use the controls to launch a mathematical "ball" down the loss curve.
Learning Rate: How big of a step the ball takes. (Default ~3e-4)
Momentum (β₁): Allows the ball to roll up small bumps to escape fake valleys! (Default ~0.9)
RMSProp (β₂) & Weight Decay (λ): (Hidden for simple physics) Defaults are usually β₂=0.95, λ=0.1.
3e-4
0.90
Why Momentum matters
Escaping Local Minima
If you look at the graph above, there is a small dip on the left side before the massive dip on the right. Without momentum, a ball dropped on the left side would roll into the small dip and stop.

With Momentum (standard in the AdamW Optimizer), the ball builds up speed as it rolls, giving it enough juice to roll out of the small dip and find the true lowest point.
Backpack Splitting
Mixed Precision & ZeRO
Even if you have the perfect optimizer, carrying all the math requires heavy "backpacks" (memory).

Mixed Precision (BF16): Instead of calculating math with 10 decimal places, the AI uses just 3 decimal places to save weight, while keeping the big numbers accurate.
ZeRO: Instead of one computer carrying the whole backpack, a cluster of 8 Graphics Cards dynamically toss the items to each other so nobody gets crushed.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.