§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
WHISPEROpenAI's STT model, trained on 680K hours of weakly-supervised audio
Whisper (Radford et al. 2022) is an encoder-decoder transformer trained on 680,000 hours of multilingual audio paired with transcripts. The scale and diversity of training data is what gives it robustness — it handles accents, background noise, and 99 languages without retraining. Five sizes available (tiny 39M, base 74M, small 244M, medium 769M, large 1550M). The large-v3 model from 2024 is the standard for production.
INPUT REPRESENTATION80-channel mel spectrogram, 30-second windows
Whisper doesn't ingest raw audio. Audio is resampled to 16kHz, converted to an 80-channel log-mel spectrogram, and chunked into 30-second windows. Each window becomes a sequence of 1500 "tokens" (3000 ms-frames at 100Hz, downsampled). The encoder transforms these into hidden states; the decoder generates the transcript token-by-token, attending to the encoder via cross-attention.
TIMESTAMPSWord-level alignment via special token prediction
Whisper learns to predict timestamp tokens (every 0.02 seconds) interleaved with transcript tokens. After decoding, you can extract word-level start/end times by parsing where the timestamp tokens fall. Post-processing tools like WhisperX use forced alignment with a smaller acoustic model to refine timestamps to ~10ms precision — necessary for subtitle creation or lip-sync.
DEPLOYMENT REALITYfaster-whisper, whisper.cpp, distil-whisper
The reference PyTorch implementation is slow. faster-whisper (CTranslate2 backend) gives 4× speedup. whisper.cpp ports the model to CPU with no Python dependencies — runs in real time on a laptop. distil-whisper (HF 2023) distills the encoder for 6× faster inference at 99% of the accuracy. The studio TTS/STT Audio Sync pipeline uses faster-whisper for batch jobs, whisper.cpp for live transcription.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.