Module f-tts · Voice · 8 min

TTS pipelines,
tier by tier.

edge-tts, ElevenLabs, voice cloning - what each tier costs in latency and quality.

Reading time8 min Audio- Prerequisites21 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Voice · Synthesis

TTS Pipelines — The Latency-Quality Frontier

edge-tts · ElevenLabs · Coqui XTTS · what each tier costs

THE TIERS

Three quality bands, three cost/latency profiles

Free / fast: edge-tts (Microsoft's Azure TTS wrapped as a free CLI), pyttsx3, espeak. Sub-second per utterance, robotic but intelligible. Good for batch narration where you have script control. Mid: ElevenLabs, Play.ht, Coqui XTTS local. ~2-5s per utterance, natural-sounding, supports voice cloning. Frontier: ElevenLabs v2, OpenAI TTS-1-hd, Anthropic voices. Sub-second time-to-first-byte streaming, indistinguishable from human at <30 seconds.
ARCHITECTURE: NEURAL CODEC LM

VALL-E and the codec language model paradigm

Modern TTS (VALL-E, NaturalSpeech 2, Voicebox) frames synthesis as language modeling over audio codec tokens. An audio codec (EnCodec, SoundStream) compresses speech into discrete tokens. A transformer learns to predict those tokens given text + a short speaker reference. Generation streams token-by-token, decoded back to audio in real time. This is why voice cloning got cheap — 3 seconds of reference audio gives the model enough to condition on.
VOICE CLONING

Few-shot speaker conditioning

Modern TTS clones a voice from 30 seconds of reference audio. The speaker embedding is computed once (via a speaker encoder like x-vector or ECAPA-TDNN) and reused for every subsequent utterance. Ethical guardrails matter: consent verification, watermarking, refusal for political figures. The studio approach: only clone consented internal voices, watermark all output, treat unauthorized voice cloning as misuse.
STREAMING TTFS

Time to first sound is what makes voice assistants feel responsive

For an attendant, total latency matters less than time to first syllable (TTFS). Modern systems target sub-300ms TTFS via streaming — the model emits audio chunks as soon as enough tokens are generated, without waiting for the full sentence. Combined with sub-200ms STT and sub-100ms LLM TTFS, you get a voice interface that feels conversational rather than turn-based.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.