§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
THE LATENCY BUDGETThree sub-second pipelines composed end-to-end
A voice attendant needs to feel like a phone call. Latency budget breakdown: (1) Speech-to-text starts decoding before the user stops talking — 100-200ms after silence onset. (2) LLM begins generating after STT first hypothesis — 100-200ms time-to-first-token. (3) TTS begins emitting audio after LLM first sentence — 100-200ms. Total target: under 600ms perceived turn latency. Anything longer and the assistant feels phone-tree.
INTERRUPTION HANDLINGThe assistant that interrupts itself loses every conversation
Voice user interfaces have to handle barge-in — the user starts talking while the assistant is speaking. The assistant must (a) detect the user's voice via VAD (voice activity detection), (b) stop its own TTS output mid-syllable, (c) start listening immediately. Without barge-in, the conversation feels like a robot. With it, it feels like a human.
TURN ENDPOINT DETECTIONWhen does the user actually stop talking?
Simple silence detection ("user paused 500ms, must be done") is wrong. People hesitate mid-sentence. Modern attendants use semantic endpoint detection: the STT hypothesis is fed to a small classifier that decides if the utterance is grammatically complete. Combined with prosodic cues (final intonation contour), this gets endpoint accuracy to ~95%. The studio TTS/STT Audio Sync pipeline uses Picovoice's Cobra VAD + a custom semantic head.
PERSONA AS A FIRST-CLASS ASSETVoice consistency across calls, sessions, and years
Your attendant's voice is your brand audio. Treat it like a logo: choose once, document the persona (warmth, pace, formality), and never silently swap it. When the underlying TTS provider updates their model, validate the new voice against the persona spec before promoting. The studio approach: clone an internal voice once, lock the voice ID in code, version the voice asset like any other brand artifact.
§ DEMO
Try it: attendant timeline.
Sub-second latency budget visualized. Watch STT, LLM, TTS slices fit into a turn-taking timeline.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.