Demo · Voice attendant · Interactive

Sub-second turn-taking,
visualized as a timeline.

User speaks · STT · LLM · TTS
Three pipelines overlap to hit <600ms TTFS
Try the "bad" vs "good" budgets

Turn being processed

USER: "What time does the studio open tomorrow?"

FOUR PARALLEL PIPELINES · time flows left to righttotal: 0 ms

USERtalking

speaks the question

STTwhisper

LLMclaude haiku

TTSelevenlabs

What this demo shows. A voice attendant is three pipelines composed end-to-end: STT (speech-to-text), LLM (large-language-model generation), TTS (text-to-speech). To feel conversational, the perceived turn latency must be under ~600ms. The trick: these don't run in sequence — they overlap. STT begins decoding while the user is still talking. The LLM starts generating once STT has a first hypothesis (not when it's done). TTS begins emitting audio after the LLM produces the first sentence (not the full response). The GOOD budget shows tight streaming pipelines hitting ~350ms time-to-first-syllable. The BAD budget shows what happens with non-streaming services that wait for each stage to complete before starting the next — the user perceives a 1200ms pause and the experience falls apart.

Sub-second turn-taking,visualized as a timeline.

Sub-second turn-taking,
visualized as a timeline.