Where every millisecond goes: VAD, EoU detection, semantic turn-taking, interruption handling, buffering strategies, filler words, and the diagnostics that tell you what to fix.
The 800 ms figure comes from conversation science, not engineering convention. Stivers et al. (PNAS, 2009) analysed 10 languages and found the median inter-turn gap is 200 ms, with the distribution tailing off sharply above 600 ms. By 800 ms, listeners start generating “uh, are you there?” uncertainty signals. By 2000 ms they assume a breakdown.
Matches fastest human conversation. Achievable with OpenAI Realtime API or a well-tuned Cartesia/Flash + Haiku pipeline. Users report the bot feels “quick-thinking”.
The target zone for most production bots. Slight pause is noticeable but interpreted as natural processing. A Deepgram Nova-3 + gpt-4o-mini + Cartesia Sonic pipeline sits at ~550–650 ms.
Users interpret silence as failure, network drop, or the bot not having heard them. Barge-in rates increase; satisfaction scores drop significantly. Whisper Large-v3 + GPT-4o + ElevenLabs v2 without streaming lands here.
The clock starts when the user finishes speaking (EoU event) and stops when the first audio byte exits the speaker. This includes: ASR final latency, any LLM prefill and TTFT, TTS first-chunk latency, and network round-trip. Every millisecond from every stage counts against the same budget.
A production voice pipeline at p50 latency looks roughly like this. Each column is a stage; stack heights are proportional to latency contribution. The optimisation levers differ by stage.
| Stage | Slow baseline | Optimised | Fastest possible | Key lever |
|---|---|---|---|---|
| VAD + EoU silence buffer | 600 ms | 300 ms | 100 ms | Silence threshold: lower = faster but more false triggers |
| ASR final transcript | 500 ms (Whisper Large) | 180 ms (Deepgram Nova-3) | 0 ms (overlap with EoU) | Cloud streaming ASR with speech_final flag |
| LLM TTFT | 700 ms (GPT-4o full prompt) | 280 ms (Claude 3.5 Haiku) | 180 ms (Llama 3.1-8B local) | Short system prompt; prompt caching; smaller model |
| TTS first chunk | 400 ms (ElevenLabs v2) | 90 ms (Cartesia Sonic) | 75 ms (ElevenLabs Flash) | Latency-first TTS; stream on sentence boundary |
| Network (server → client) | 80 ms (cross-region) | 20 ms (same-region) | 5 ms (co-located) | Deploy agent close to TTS service; use CDN edge |
EoU 400 + ASR 500 + LLM 700 + TTS 300 + net 50 = ~1950 ms. Over budget by 2×. Do not ship this without aggressive optimisation.
EoU 300 + ASR 180 + LLM 320 + TTS 90 + net 20 = ~910 ms. Near budget. Tune EoU silence to 200 ms to reach ~800 ms.
EoU 300 + audio→audio TTFT 380 + net 20 = ~700 ms. Under budget at p50. At p95 (busy periods) ≈ 950 ms — still acceptable.
With streaming ASR (interim_results=true), the LLM can start receiving partial transcripts before speech_final is set. A common pattern is to begin LLM pre-fill with interim transcripts and only “commit” the generation after the final transcript arrives. This overlaps ≈ 200 ms of ASR wait with LLM prefill, effectively removing ASR from the serial budget.
Voice Activity Detection (VAD) answers the binary question: is the user speaking right now?. End-of-Utterance (EoU) answers: has the user finished speaking?. They are related but distinct. A VAD that fires on every voiced frame will over-trigger; a silence-based EoU that waits 800 ms will feel sluggish.
Silero VAD (Silero AI, 2021; v5.1 as of 2025) is an 8-layer LSTM-based classifier trained on 6,000 h of noisy multilingual audio. Input: 512-sample (32 ms at 16 kHz) STFT-magnitude chunk. Output: speech probability 0.0–1.0. Inference on CPU: <1 ms / chunk; ONNX-exported for cross-platform use. Model size: 1.6 MB. Ships with Pipecat as SileroVADAnalyzer; available directly via silero_vad PyPI package.
threshold (default 0.5): speech/non-speech boundary. Lower = more sensitive, more false positives. 0.35 is common for noisy call-centre audio.min_speech_duration_ms (default 250): ignore bursts shorter than this. Filters coughs, clicks.min_silence_duration_ms (default 100): minimum silence before a speech segment is ended. Set higher to avoid splitting mid-sentence pauses.speech_pad_ms (default 30): padding added to both ends of a detected segment.import asyncio, time
from silero_vad import load_silero_vad, read_audio
model = load_silero_vad()
EOU_SILENCE_MS = 300 # declare end-of-utterance after 300 ms silence
last_speech: float = 0.0
in_utterance: bool = False
async def process_chunk(pcm_chunk: bytes) -> None:
global last_speech, in_utterance
# model expects float32 tensor of shape [512]
prob = model(pcm_tensor, 16000).item()
now = time.monotonic() * 1000
if prob > 0.5:
last_speech = now
if not in_utterance:
in_utterance = True
await on_speech_started()
elif in_utterance and (now - last_speech) > EOU_SILENCE_MS:
in_utterance = False
await on_end_of_utterance() # → trigger ASR final + LLM
When the TTS is playing audio, that audio leaks back into the microphone. Without acoustic echo cancellation (AEC), the VAD will detect the bot’s own voice as user speech and enter an infinite loop. In WebRTC (LiveKit / Daily) AEC is handled by the browser’s WebRTC stack. In a telephony pipeline, you must gate the VAD with a TTSStartedFrame / TTSStoppedFrame signal and suppress VAD during TTS playout.
Silence-based EoU has a fundamental problem: speakers pause mid-sentence. A 300 ms silence after “I want to book a flight to —” should not trigger a bot response. Semantic turn-taking uses a small model to predict whether the user has finished a complete thought, not just gone quiet.
A small binary text classifier (e.g. a fine-tuned DistilBERT or a 3B-parameter SLM) receives the interim ASR transcript and outputs P(turn_complete). Threshold at ~0.85. LiveKit uses this pattern with an optional turn_detector plugin (model: livekit-agents/turn-detector-v1, 60M params, fine-tuned from SmolLM). Adds ~30 ms latency for inference but reduces false EoU by ~40 %.
Deepgram’s speech_final: true flag and AssemblyAI’s session_information.punctuate both apply sentence-boundary models on the server side. These are effectively semantic EoU: the model has detected a sentence-final intonation pattern or punctuation. Enabling them adds ≈ 50–80 ms server-side processing but eliminates most mid-sentence false triggers.
from livekit.plugins import turn_detector
agent = VoicePipelineAgent(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-3",
endpointing_ms=25), # short silence; turn-detector controls EoU
turn_detector=turn_detector.EOUModel(
unlikely_threshold=0.3, # below this: definitely not EoU
likely_threshold=0.8, # above this: definitely EoU
),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(voice="a0e99841..."),
min_endpointing_delay=0.2, # 200 ms minimum silence regardless
max_endpointing_delay=1.0, # 1 s maximum wait; force EoU
)
Semantic EoU adds ~30–80 ms latency but prevents a significantly worse user experience: a bot that interrupts mid-sentence, generates a partial-question response, then has to recover. The rule of thumb: use silence-only EoU for short-form interactions (yes/no, number entry) and semantic EoU for open-ended conversation. The LiveKit turn-detector v1 has a WER-of-EoU of ~4 % vs. ~18 % for 300 ms silence alone on the Santa Barbara Corpus of Spoken American English.
Interruption (barge-in) happens when the user speaks while the bot is playing audio. This is natural in human conversation — we interrupt to redirect, correct, or confirm. A bot that ignores barge-in feels robotic; a bot that is over-sensitive cancels unnecessarily. Getting this right requires coordination across three layers: VAD, pipeline state machine, and TTS playout.
Cancel the streaming HTTP request to the LLM provider. In Python: response_task.cancel() on the asyncio task wrapping the streaming call. Drop all buffered tokens. Do not flush partial text to TTS.
Stop streaming from the TTS WebSocket. In Pipecat: send a TTSStoppedFrame which causes OutputTransport to stop writing PCM. The in-flight audio buffer in the WebRTC jitter buffer will drain (typically 1–2 chunks, ≈ 200–500 ms). You cannot instantly silence already-transmitted audio; account for the drain latency in UX testing.
Inject a truncation marker into the LLM context so the model knows it was interrupted: {"role":"assistant","content":"[interrupted]"}. Without this, the model’s next response may try to continue the cancelled sentence, producing incoherent speech.
Set the VAD threshold higher during TTS playout (≈ 0.7 vs 0.5 at rest) to avoid triggering on background noise or the user clearing their throat. Pipecat exposes vad.set_args(threshold=0.7) which you call in a TTSStartedFrame handler and revert in TTSStoppedFrame. LiveKit’s VoicePipelineAgent does this automatically when allow_interruptions=True.
The gap between LLM token generation and TTS audio output is filled by two buffers: the TTS input buffer (text waiting to be synthesised) and the TTS output buffer (PCM audio waiting to be played). Getting the buffer sizes right prevents both underruns (silence gaps) and overruns (delayed cancellation on barge-in).
Buffer LLM tokens until a sentence-final boundary: ., !, ?, or ; followed by whitespace. Send the complete sentence to TTS as a single call. This is the lowest-latency strategy that avoids prosody artefacts from mid-sentence splits. Implementation: a small FSM or regex r’(?<=[.!?;])\s’.
If no sentence boundary arrives within 200 ms of the last token, flush the current buffer regardless. This prevents “silent wait” at the start of long LLM outputs where the model takes >200 ms to emit the first sentence boundary. Pipecat’s LLMFullResponseAggregator implements both strategies with sentence_aggregator_timeout_secs=0.2.
| Buffer size | Duration (44.1 kHz stereo PCM) | Trade-off |
|---|---|---|
| 1 chunk (4096 samples) | ~93 ms | Minimum latency; any jitter causes audible gaps |
| 3 chunks (12288 samples) | ~278 ms | Recommended minimum for network streaming (WebSocket latency jitter ~50 ms p95) |
| 6 chunks (24576 samples) | ~557 ms | Safe for high-jitter PSTN paths; adds 280 ms to cancel latency on barge-in |
| Adaptive (Cartesia default) | ~150–350 ms | Grows during network congestion, shrinks during low jitter; best default |
For multi-sentence responses, begin streaming the first sentence to TTS before the LLM has finished generating sentence two. The TTS is playing sentence one while sentence two is still being generated, hiding both TTS processing latency and inter-sentence LLM generation time behind audio playout. Pipecat does this automatically when you pipe LLM streaming output directly to a TTS processor without a FullResponseAggregator.
When the pipeline latency genuinely exceeds 500 ms, a purely technical fix is not always feasible. Filler words (“Let me check that…”, “Sure, one moment…”) are a social-engineering technique that masks perceived latency by giving the user something to hear immediately, shifting the clock from “silence start” to “content start”.
On EoU detection, immediately synthesise and begin playing a short filler phrase from a pre-rendered audio library (not through TTS — that adds latency!). Pre-render 5–10 phrases as PCM files at startup: “Let me think…”, “Sure…”, “Absolutely…”. Play one at random while the LLM generates. Total added latency: 0 ms (playback begins instantly from memory).
Prompt the LLM to begin every response with a one-word or one-phrase acknowledgement: “Always start your reply with a brief acknowledgement word or phrase like ‘Sure’, ‘Got it’, or ‘Of course’ before the substantive content.” This is synthesised as part of the normal TTS pipeline; no audio pre-rendering needed. The filler is typically 1–3 tokens, so TTS first-chunk arrives at ~90 ms (Cartesia Sonic) with a fully natural voice rather than a recorded clip.
| Scenario | Approach | Rationale |
|---|---|---|
| Tool call / database lookup (>1 s delay) | Pre-rendered filler + progress update | LLM cannot reply until tool result returns; pre-rendered plays instantly |
| General response (<1 s expected) | LLM-instructed filler | Natural voice continuity; no jarring transition from recorded to synthesised |
| High-emotion / empathetic context | LLM-instructed only | Pre-rendered clips sound robotic in emotional moments; LLM can vary filler by context |
| PSTN / telephony (legacy phones) | Comfort noise injection | Some legacy handsets interpret silence as call drop; DTMF comfort-noise burst keeps circuit alive |
Users exposed to the same filler phrase >3 times in a session rate it as “robotic” (internal A/B test data from Daily.co, 2024). Maintain a per-session filler history and weight against recently used phrases. A library of 8–12 phrases with weighted random selection reduces repetition to imperceptible levels.
You cannot optimise what you do not measure. A production voice bot should emit structured telemetry for every session turn. The key metrics span three categories: latency, quality, and conversation health.
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer("voice-bot")
class InstrumentedSTTService(DeepgramSTTService):
async def process_frame(self, frame, direction):
if isinstance(frame, UserStartedSpeakingFrame):
self._eou_start = time.monotonic()
if isinstance(frame, TranscriptionFrame) and frame.is_final:
latency_ms = (time.monotonic() - self._eou_start) * 1000
with tracer.start_as_current_span("asr_final_latency",
kind=SpanKind.INTERNAL) as span:
span.set_attribute("latency_ms", latency_ms)
span.set_attribute("transcript_len", len(frame.text))
HISTOGRAM_ASR.record(latency_ms)
await super().process_frame(frame, direction)
| Metric | Target | Alert threshold | Instrumentation point |
|---|---|---|---|
eou_to_first_audio_ms | p50 < 700 ms | p95 > 1200 ms | EoU frame → first AudioRawFrame out |
asr_final_latency_ms | p50 < 200 ms | p95 > 500 ms | EoU frame → TranscriptionFrame(is_final=True) |
llm_ttft_ms | p50 < 350 ms | p95 > 800 ms | First token of LLM prompt sent → first token received |
tts_first_chunk_ms | p50 < 150 ms | p95 > 400 ms | First sentence sent to TTS → first audio chunk received |
barge_in_rate | 3–8 % | > 20 %: bot too slow; < 1 %: possible echo-cancel issue | Count UserStartedSpeakingFrame during GENERATING state |
false_eou_rate | < 5 % | > 15 % | Manual annotation of recordings; or heuristic: LLM response < 5 tokens to a >50-word utterance |
Store the full audio and transcript of a random 5 % sample of sessions (with user consent, per GDPR Art. 6 legitimate interest or explicit consent). Manual review of the worst 1 % by eou_to_first_audio_ms will surface the specific pipeline stage responsible for latency spikes far faster than any automated alert. Use dnsmos_p.835 (DNSMOS v3) to auto-score audio quality on the sample and surface MOS regressions after provider updates.
TTSStoppedFrame to drain playout buffer. Inject [interrupted] into the LLM context to prevent incoherent continuation.eou_to_first_audio_ms at p50 and p95. Alert on p95 > 1200 ms. Record 5 % of sessions; review the worst 1 % by latency to find the bottleneck.These three decks cover the complete voice stack: Deck 01 gave you the ASR and TTS component choices and their quality/latency/cost trade-offs. Deck 02 showed you the orchestration frameworks that wire them together. Deck 03 — this one — explained how to measure, diagnose, and defend the 800 ms latency budget in production. The Voice & Real-Time Agents sub-hub links the full series index.