§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
THE TIERSThree quality bands, three cost/latency profiles
Free / fast: edge-tts (Microsoft's Azure TTS wrapped as a free CLI), pyttsx3, espeak. Sub-second per utterance, robotic but intelligible. Good for batch narration where you have script control. Mid: ElevenLabs, Play.ht, Coqui XTTS local. ~2-5s per utterance, natural-sounding, supports voice cloning. Frontier: ElevenLabs v2, OpenAI TTS-1-hd, Anthropic voices. Sub-second time-to-first-byte streaming, indistinguishable from human at <30 seconds.
ARCHITECTURE: NEURAL CODEC LMVALL-E and the codec language model paradigm
Modern TTS (VALL-E, NaturalSpeech 2, Voicebox) frames synthesis as language modeling over audio codec tokens. An audio codec (EnCodec, SoundStream) compresses speech into discrete tokens. A transformer learns to predict those tokens given text + a short speaker reference. Generation streams token-by-token, decoded back to audio in real time. This is why voice cloning got cheap — 3 seconds of reference audio gives the model enough to condition on.
VOICE CLONINGFew-shot speaker conditioning
Modern TTS clones a voice from 30 seconds of reference audio. The speaker embedding is computed once (via a speaker encoder like x-vector or ECAPA-TDNN) and reused for every subsequent utterance. Ethical guardrails matter: consent verification, watermarking, refusal for political figures. The studio approach: only clone consented internal voices, watermark all output, treat unauthorized voice cloning as misuse.
STREAMING TTFSTime to first sound is what makes voice assistants feel responsive
For an attendant, total latency matters less than time to first syllable (TTFS). Modern systems target sub-300ms TTFS via streaming — the model emits audio chunks as soon as enough tokens are generated, without waiting for the full sentence. Combined with sub-200ms STT and sub-100ms LLM TTFS, you get a voice interface that feels conversational rather than turn-based.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.