Module f-stt · Voice · 8 min

STT pipelines
(Whisper).

Whisper architecture, alignment timestamps, language detection.

Reading time8 min Audio- Prerequisites21 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Voice · Recognition

STT Pipelines — Whisper Architecture

Robust speech recognition via large-scale weak supervision · alignment timestamps

WHISPER

OpenAI's STT model, trained on 680K hours of weakly-supervised audio

Whisper (Radford et al. 2022) is an encoder-decoder transformer trained on 680,000 hours of multilingual audio paired with transcripts. The scale and diversity of training data is what gives it robustness — it handles accents, background noise, and 99 languages without retraining. Five sizes available (tiny 39M, base 74M, small 244M, medium 769M, large 1550M). The large-v3 model from 2024 is the standard for production.
INPUT REPRESENTATION

80-channel mel spectrogram, 30-second windows

Whisper doesn't ingest raw audio. Audio is resampled to 16kHz, converted to an 80-channel log-mel spectrogram, and chunked into 30-second windows. Each window becomes a sequence of 1500 "tokens" (3000 ms-frames at 100Hz, downsampled). The encoder transforms these into hidden states; the decoder generates the transcript token-by-token, attending to the encoder via cross-attention.
TIMESTAMPS

Word-level alignment via special token prediction

Whisper learns to predict timestamp tokens (every 0.02 seconds) interleaved with transcript tokens. After decoding, you can extract word-level start/end times by parsing where the timestamp tokens fall. Post-processing tools like WhisperX use forced alignment with a smaller acoustic model to refine timestamps to ~10ms precision — necessary for subtitle creation or lip-sync.
DEPLOYMENT REALITY

faster-whisper, whisper.cpp, distil-whisper

The reference PyTorch implementation is slow. faster-whisper (CTranslate2 backend) gives 4× speedup. whisper.cpp ports the model to CPU with no Python dependencies — runs in real time on a laptop. distil-whisper (HF 2023) distills the encoder for 6× faster inference at 99% of the accuracy. The studio TTS/STT Audio Sync pipeline uses faster-whisper for batch jobs, whisper.cpp for live transcription.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.