Voice 03 — Latency & Turn-Taking — 800 ms Target, VAD, Interruptions

00

Topics We’ll Cover

The 800 ms Target — What Humans Expect
Where Latency Comes From — ASR, LLM TTFT, TTS First Chunk, Network
VAD & End-of-Utterance Detection
Semantic Turn-Taking — Model-Based EoU
Interruption Handling — User Speaks During Reply
Buffering & Chunking — Keep Audio Flowing
Filler Words & the “Thinking” Effect
Diagnostics & Metrics
What to Take Away

01

The 800 ms Target — What Humans Expect

The 800 ms figure comes from conversation science, not engineering convention. Stivers et al. (PNAS, 2009) analysed 10 languages and found the median inter-turn gap is 200 ms, with the distribution tailing off sharply above 600 ms. By 800 ms, listeners start generating “uh, are you there?” uncertainty signals. By 2000 ms they assume a breakdown.

Perceptual categories

< 300 ms — Instant

Matches fastest human conversation. Achievable with OpenAI Realtime API or a well-tuned Cartesia/Flash + Haiku pipeline. Users report the bot feels “quick-thinking”.

300–800 ms — Acceptable

The target zone for most production bots. Slight pause is noticeable but interpreted as natural processing. A Deepgram Nova-3 + gpt-4o-mini + Cartesia Sonic pipeline sits at ~550–650 ms.

> 800 ms — Problematic

Users interpret silence as failure, network drop, or the bot not having heard them. Barge-in rates increase; satisfaction scores drop significantly. Whisper Large-v3 + GPT-4o + ElevenLabs v2 without streaming lands here.

The 800 ms is end-to-end

The clock starts when the user finishes speaking (EoU event) and stops when the first audio byte exits the speaker. This includes: ASR final latency, any LLM prefill and TTFT, TTS first-chunk latency, and network round-trip. Every millisecond from every stage counts against the same budget.

02

Where Latency Comes From — ASR, LLM TTFT, TTS First Chunk, Network

A production voice pipeline at p50 latency looks roughly like this. Each column is a stage; stack heights are proportional to latency contribution. The optimisation levers differ by stage.

Stage	Slow baseline	Optimised	Fastest possible	Key lever
VAD + EoU silence buffer	600 ms	300 ms	100 ms	Silence threshold: lower = faster but more false triggers
ASR final transcript	500 ms (Whisper Large)	180 ms (Deepgram Nova-3)	0 ms (overlap with EoU)	Cloud streaming ASR with `speech_final` flag
LLM TTFT	700 ms (GPT-4o full prompt)	280 ms (Claude 3.5 Haiku)	180 ms (Llama 3.1-8B local)	Short system prompt; prompt caching; smaller model
TTS first chunk	400 ms (ElevenLabs v2)	90 ms (Cartesia Sonic)	75 ms (ElevenLabs Flash)	Latency-first TTS; stream on sentence boundary
Network (server → client)	80 ms (cross-region)	20 ms (same-region)	5 ms (co-located)	Deploy agent close to TTS service; use CDN edge

Example stacks — total p50 latency

Whisper L + GPT-4o + ElevenLabs v2

EoU 400 + ASR 500 + LLM 700 + TTS 300 + net 50 = ~1950 ms. Over budget by 2×. Do not ship this without aggressive optimisation.

Deepgram Nova-3 + gpt-4o-mini + Cartesia Sonic 2

EoU 300 + ASR 180 + LLM 320 + TTS 90 + net 20 = ~910 ms. Near budget. Tune EoU silence to 200 ms to reach ~800 ms.

OpenAI Realtime API

EoU 300 + audio→audio TTFT 380 + net 20 = ~700 ms. Under budget at p50. At p95 (busy periods) ≈ 950 ms — still acceptable.

Parallelise ASR and LLM pre-fill

With streaming ASR (interim_results=true), the LLM can start receiving partial transcripts before speech_final is set. A common pattern is to begin LLM pre-fill with interim transcripts and only “commit” the generation after the final transcript arrives. This overlaps ≈ 200 ms of ASR wait with LLM prefill, effectively removing ASR from the serial budget.

03

VAD & End-of-Utterance Detection

Voice Activity Detection (VAD) answers the binary question: is the user speaking right now?. End-of-Utterance (EoU) answers: has the user finished speaking?. They are related but distinct. A VAD that fires on every voiced frame will over-trigger; a silence-based EoU that waits 800 ms will feel sluggish.

Silero VAD — the standard choice

Architecture

Silero VAD (Silero AI, 2021; v5.1 as of 2025) is an 8-layer LSTM-based classifier trained on 6,000 h of noisy multilingual audio. Input: 512-sample (32 ms at 16 kHz) STFT-magnitude chunk. Output: speech probability 0.0–1.0. Inference on CPU: <1 ms / chunk; ONNX-exported for cross-platform use. Model size: 1.6 MB. Ships with Pipecat as SileroVADAnalyzer; available directly via silero_vad PyPI package.

Key parameters

threshold (default 0.5): speech/non-speech boundary. Lower = more sensitive, more false positives. 0.35 is common for noisy call-centre audio.
min_speech_duration_ms (default 250): ignore bursts shorter than this. Filters coughs, clicks.
min_silence_duration_ms (default 100): minimum silence before a speech segment is ended. Set higher to avoid splitting mid-sentence pauses.
speech_pad_ms (default 30): padding added to both ends of a detected segment.

Silence-based EoU implementation

python — Silero VAD + EoU timer in an asyncio stream processor

import asyncio, time
from silero_vad import load_silero_vad, read_audio

model = load_silero_vad()
EOU_SILENCE_MS = 300   # declare end-of-utterance after 300 ms silence

last_speech: float = 0.0
in_utterance: bool = False

async def process_chunk(pcm_chunk: bytes) -> None:
    global last_speech, in_utterance
    # model expects float32 tensor of shape [512]
    prob = model(pcm_tensor, 16000).item()
    now = time.monotonic() * 1000

    if prob > 0.5:
        last_speech = now
        if not in_utterance:
            in_utterance = True
            await on_speech_started()
    elif in_utterance and (now - last_speech) > EOU_SILENCE_MS:
        in_utterance = False
        await on_end_of_utterance()  # → trigger ASR final + LLM

Echo cancellation — never miss this

When the TTS is playing audio, that audio leaks back into the microphone. Without acoustic echo cancellation (AEC), the VAD will detect the bot’s own voice as user speech and enter an infinite loop. In WebRTC (LiveKit / Daily) AEC is handled by the browser’s WebRTC stack. In a telephony pipeline, you must gate the VAD with a TTSStartedFrame / TTSStoppedFrame signal and suppress VAD during TTS playout.

04

Semantic Turn-Taking — Model-Based EoU

Silence-based EoU has a fundamental problem: speakers pause mid-sentence. A 300 ms silence after “I want to book a flight to —” should not trigger a bot response. Semantic turn-taking uses a small model to predict whether the user has finished a complete thought, not just gone quiet.

Two approaches

Turn classifier on ASR interim text

A small binary text classifier (e.g. a fine-tuned DistilBERT or a 3B-parameter SLM) receives the interim ASR transcript and outputs P(turn_complete). Threshold at ~0.85. LiveKit uses this pattern with an optional turn_detector plugin (model: livekit-agents/turn-detector-v1, 60M params, fine-tuned from SmolLM). Adds ~30 ms latency for inference but reduces false EoU by ~40 %.

End-of-utterance token in ASR

Deepgram’s speech_final: true flag and AssemblyAI’s session_information.punctuate both apply sentence-boundary models on the server side. These are effectively semantic EoU: the model has detected a sentence-final intonation pattern or punctuation. Enabling them adds ≈ 50–80 ms server-side processing but eliminates most mid-sentence false triggers.

LiveKit turn-detector integration

python — livekit-agents with semantic turn-detector

from livekit.plugins import turn_detector

agent = VoicePipelineAgent(
    vad=silero.VAD.load(),
    stt=deepgram.STT(model="nova-3",
                       endpointing_ms=25),      # short silence; turn-detector controls EoU
    turn_detector=turn_detector.EOUModel(
        unlikely_threshold=0.3,               # below this: definitely not EoU
        likely_threshold=0.8,                # above this: definitely EoU
    ),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=cartesia.TTS(voice="a0e99841..."),
    min_endpointing_delay=0.2,               # 200 ms minimum silence regardless
    max_endpointing_delay=1.0,               # 1 s maximum wait; force EoU
)

Latency vs. accuracy trade-off

Semantic EoU adds ~30–80 ms latency but prevents a significantly worse user experience: a bot that interrupts mid-sentence, generates a partial-question response, then has to recover. The rule of thumb: use silence-only EoU for short-form interactions (yes/no, number entry) and semantic EoU for open-ended conversation. The LiveKit turn-detector v1 has a WER-of-EoU of ~4 % vs. ~18 % for 300 ms silence alone on the Santa Barbara Corpus of Spoken American English.

05

Interruption Handling — User Speaks During Reply

Interruption (barge-in) happens when the user speaks while the bot is playing audio. This is natural in human conversation — we interrupt to redirect, correct, or confirm. A bot that ignores barge-in feels robotic; a bot that is over-sensitive cancels unnecessarily. Getting this right requires coordination across three layers: VAD, pipeline state machine, and TTS playout.

State machine for barge-in

State: IDLE — waiting for user

↓ VAD speech_start

State: LISTENING — ASR streaming

↓ EoU detected

State: GENERATING — LLM + TTS running

↓ VAD speech_start (barge-in!)

Barge-in: cancel LLM & TTS; truncate playout; return to LISTENING

Implementation details by layer

LLM cancellation

Cancel the streaming HTTP request to the LLM provider. In Python: response_task.cancel() on the asyncio task wrapping the streaming call. Drop all buffered tokens. Do not flush partial text to TTS.

TTS playout truncation

Stop streaming from the TTS WebSocket. In Pipecat: send a TTSStoppedFrame which causes OutputTransport to stop writing PCM. The in-flight audio buffer in the WebRTC jitter buffer will drain (typically 1–2 chunks, ≈ 200–500 ms). You cannot instantly silence already-transmitted audio; account for the drain latency in UX testing.

Context injection

Inject a truncation marker into the LLM context so the model knows it was interrupted: {"role":"assistant","content":"[interrupted]"}. Without this, the model’s next response may try to continue the cancelled sentence, producing incoherent speech.

Barge-in sensitivity tuning

Set the VAD threshold higher during TTS playout (≈ 0.7 vs 0.5 at rest) to avoid triggering on background noise or the user clearing their throat. Pipecat exposes vad.set_args(threshold=0.7) which you call in a TTSStartedFrame handler and revert in TTSStoppedFrame. LiveKit’s VoicePipelineAgent does this automatically when allow_interruptions=True.

06

Buffering & Chunking — Keep Audio Flowing

The gap between LLM token generation and TTS audio output is filled by two buffers: the TTS input buffer (text waiting to be synthesised) and the TTS output buffer (PCM audio waiting to be played). Getting the buffer sizes right prevents both underruns (silence gaps) and overruns (delayed cancellation on barge-in).

Text chunking strategy

Sentence-boundary flushing

Buffer LLM tokens until a sentence-final boundary: ., !, ?, or ; followed by whitespace. Send the complete sentence to TTS as a single call. This is the lowest-latency strategy that avoids prosody artefacts from mid-sentence splits. Implementation: a small FSM or regex r’(?<=[.!?;])\s’.

Timeout-based flushing

If no sentence boundary arrives within 200 ms of the last token, flush the current buffer regardless. This prevents “silent wait” at the start of long LLM outputs where the model takes >200 ms to emit the first sentence boundary. Pipecat’s LLMFullResponseAggregator implements both strategies with sentence_aggregator_timeout_secs=0.2.

Audio playout buffer sizing

Buffer size	Duration (44.1 kHz stereo PCM)	Trade-off
1 chunk (4096 samples)	~93 ms	Minimum latency; any jitter causes audible gaps
3 chunks (12288 samples)	~278 ms	Recommended minimum for network streaming (WebSocket latency jitter ~50 ms p95)
6 chunks (24576 samples)	~557 ms	Safe for high-jitter PSTN paths; adds 280 ms to cancel latency on barge-in
Adaptive (Cartesia default)	~150–350 ms	Grows during network congestion, shrinks during low jitter; best default

The streaming TTS overlap trick

For multi-sentence responses, begin streaming the first sentence to TTS before the LLM has finished generating sentence two. The TTS is playing sentence one while sentence two is still being generated, hiding both TTS processing latency and inter-sentence LLM generation time behind audio playout. Pipecat does this automatically when you pipe LLM streaming output directly to a TTS processor without a FullResponseAggregator.

07

Filler Words & the “Thinking” Effect

When the pipeline latency genuinely exceeds 500 ms, a purely technical fix is not always feasible. Filler words (“Let me check that…”, “Sure, one moment…”) are a social-engineering technique that masks perceived latency by giving the user something to hear immediately, shifting the clock from “silence start” to “content start”.

Filler architecture patterns

Immediate pre-synthesis filler

On EoU detection, immediately synthesise and begin playing a short filler phrase from a pre-rendered audio library (not through TTS — that adds latency!). Pre-render 5–10 phrases as PCM files at startup: “Let me think…”, “Sure…”, “Absolutely…”. Play one at random while the LLM generates. Total added latency: 0 ms (playback begins instantly from memory).

LLM-instructed filler injection

Prompt the LLM to begin every response with a one-word or one-phrase acknowledgement: “Always start your reply with a brief acknowledgement word or phrase like ‘Sure’, ‘Got it’, or ‘Of course’ before the substantive content.” This is synthesised as part of the normal TTS pipeline; no audio pre-rendering needed. The filler is typically 1–3 tokens, so TTS first-chunk arrives at ~90 ms (Cartesia Sonic) with a fully natural voice rather than a recorded clip.

When to use which

Scenario	Approach	Rationale
Tool call / database lookup (>1 s delay)	Pre-rendered filler + progress update	LLM cannot reply until tool result returns; pre-rendered plays instantly
General response (<1 s expected)	LLM-instructed filler	Natural voice continuity; no jarring transition from recorded to synthesised
High-emotion / empathetic context	LLM-instructed only	Pre-rendered clips sound robotic in emotional moments; LLM can vary filler by context
PSTN / telephony (legacy phones)	Comfort noise injection	Some legacy handsets interpret silence as call drop; DTMF comfort-noise burst keeps circuit alive

Avoid filler overuse

Users exposed to the same filler phrase >3 times in a session rate it as “robotic” (internal A/B test data from Daily.co, 2024). Maintain a per-session filler history and weight against recently used phrases. A library of 8–12 phrases with weighted random selection reduces repetition to imperceptible levels.

08

Diagnostics & Metrics

You cannot optimise what you do not measure. A production voice bot should emit structured telemetry for every session turn. The key metrics span three categories: latency, quality, and conversation health.

Latency metrics — instrument every stage

python — OpenTelemetry span instrumentation in a Pipecat processor

from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("voice-bot")

class InstrumentedSTTService(DeepgramSTTService):
    async def process_frame(self, frame, direction):
        if isinstance(frame, UserStartedSpeakingFrame):
            self._eou_start = time.monotonic()

        if isinstance(frame, TranscriptionFrame) and frame.is_final:
            latency_ms = (time.monotonic() - self._eou_start) * 1000
            with tracer.start_as_current_span("asr_final_latency",
                                                kind=SpanKind.INTERNAL) as span:
                span.set_attribute("latency_ms", latency_ms)
                span.set_attribute("transcript_len", len(frame.text))
                HISTOGRAM_ASR.record(latency_ms)

        await super().process_frame(frame, direction)

Key metrics dashboard

Metric	Target	Alert threshold	Instrumentation point
`eou_to_first_audio_ms`	p50 < 700 ms	p95 > 1200 ms	EoU frame → first `AudioRawFrame` out
`asr_final_latency_ms`	p50 < 200 ms	p95 > 500 ms	EoU frame → `TranscriptionFrame(is_final=True)`
`llm_ttft_ms`	p50 < 350 ms	p95 > 800 ms	First token of LLM prompt sent → first token received
`tts_first_chunk_ms`	p50 < 150 ms	p95 > 400 ms	First sentence sent to TTS → first audio chunk received
`barge_in_rate`	3–8 %	> 20 %: bot too slow; < 1 %: possible echo-cancel issue	Count `UserStartedSpeakingFrame` during GENERATING state
`false_eou_rate`	< 5 %	> 15 %	Manual annotation of recordings; or heuristic: LLM response < 5 tokens to a >50-word utterance

Record every session

Store the full audio and transcript of a random 5 % sample of sessions (with user consent, per GDPR Art. 6 legitimate interest or explicit consent). Manual review of the worst 1 % by eou_to_first_audio_ms will surface the specific pipeline stage responsible for latency spikes far faster than any automated alert. Use dnsmos_p.835 (DNSMOS v3) to auto-score audio quality on the sample and surface MOS regressions after provider updates.

09

What to Take Away

800 ms is the budget. It runs from EoU to first audio byte. The clock starts when the user finishes speaking, not when your server receives the event. Account for network and buffer latency in your budget model.
VAD and EoU are separate problems. Silero VAD detects voiced frames reliably in <1 ms. EoU requires either a silence timeout (300–500 ms, tunable) or a semantic turn-taking model (adds 30–80 ms, reduces false triggers by ~40 %). Use semantic EoU for open-ended conversation.
Barge-in requires tri-layer coordination: cancel LLM task → stop TTS WebSocket → send TTSStoppedFrame to drain playout buffer. Inject [interrupted] into the LLM context to prevent incoherent continuation.
Buffer strategically. 3–6 TTS output chunks prevent jitter-induced gaps. Sentence-boundary flushing to TTS prevents prosody artefacts. Overlap TTS playout of sentence N with generation of sentence N+1.
Filler words mask unavoidable latency. Pre-render 8–12 short phrases as PCM for instant playback during tool calls. Use LLM-instructed fillers for natural-voice continuity on shorter delays.
Instrument every stage. eou_to_first_audio_ms at p50 and p95. Alert on p95 > 1200 ms. Record 5 % of sessions; review the worst 1 % by latency to find the bottleneck.

The full picture

These three decks cover the complete voice stack: Deck 01 gave you the ASR and TTS component choices and their quality/latency/cost trade-offs. Deck 02 showed you the orchestration frameworks that wire them together. Deck 03 — this one — explained how to measure, diagnose, and defend the 800 ms latency budget in production. The Voice & Real-Time Agents sub-hub links the full series index.