Module f-attendant · Voice · 10 min

Turn-taking and
attendants.

Sub-300ms first-syllable latency, the patterns underneath the AI Attendant project.

Reading time10 min Audio- Prerequisitesf-stt, f-tts SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Voice · Conversation

Turn-Taking & Voice Attendants

Sub-300ms first-syllable latency · the patterns underneath the AI Attendant project

THE LATENCY BUDGET

Three sub-second pipelines composed end-to-end

A voice attendant needs to feel like a phone call. Latency budget breakdown: (1) Speech-to-text starts decoding before the user stops talking — 100-200ms after silence onset. (2) LLM begins generating after STT first hypothesis — 100-200ms time-to-first-token. (3) TTS begins emitting audio after LLM first sentence — 100-200ms. Total target: under 600ms perceived turn latency. Anything longer and the assistant feels phone-tree.
INTERRUPTION HANDLING

The assistant that interrupts itself loses every conversation

Voice user interfaces have to handle barge-in — the user starts talking while the assistant is speaking. The assistant must (a) detect the user's voice via VAD (voice activity detection), (b) stop its own TTS output mid-syllable, (c) start listening immediately. Without barge-in, the conversation feels like a robot. With it, it feels like a human.
TURN ENDPOINT DETECTION

When does the user actually stop talking?

Simple silence detection ("user paused 500ms, must be done") is wrong. People hesitate mid-sentence. Modern attendants use semantic endpoint detection: the STT hypothesis is fed to a small classifier that decides if the utterance is grammatically complete. Combined with prosodic cues (final intonation contour), this gets endpoint accuracy to ~95%. The studio TTS/STT Audio Sync pipeline uses Picovoice's Cobra VAD + a custom semantic head.
PERSONA AS A FIRST-CLASS ASSET

Voice consistency across calls, sessions, and years

Your attendant's voice is your brand audio. Treat it like a logo: choose once, document the persona (warmth, pace, formality), and never silently swap it. When the underlying TTS provider updates their model, validate the new voice against the persona spec before promoting. The studio approach: clone an internal voice once, lock the voice ID in code, version the voice asset like any other brand artifact.
§ DEMO

Try it: attendant timeline.

Sub-second latency budget visualized. Watch STT, LLM, TTS slices fit into a turn-taking timeline.

Attendant Timeline · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.