Key Takeaways
- Sub-300ms end-to-end latency is the human-conversation threshold for voice AI.
- The latency budget breaks into four layers: STT (80–120ms), LLM first-token (150–250ms), TTS first-chunk (60–100ms), and network transport (20–60ms). Missing target in any one layer pushes the total over 500ms.
- WebRTC with ICE Trickle is the correct transport for browser and mobile clients. SIP is the right choice for PSTN integration and legacy telephony.
- LiveKit SFU reduces media server complexity by forwarding encoded streams rather than decoding and re-mixing them, and its hosted tier removes the need to operate a media server fleet entirely.
Why Voice AI Fails in Production
Voice AI demos look deceptively easy. A GPT-4o API call, a TTS response, a microphone input — connected together in 200 lines of Python, the thing works. Then you put it in front of real users and it fails.
The failure is almost never the model. It is the architecture.
In production at 2000+ calls per day — the scale Prodinit operates for a healthcare scheduling platform — three classes of failure dominate: latency spikes that destroy conversational flow, audio glitches from unmanaged WebRTC sessions, and compliance gaps where customer PII surfaces in LLM provider logs. None of these appear in a notebook demo. All of them have architecture solutions.
This guide walks through the complete production stack: what latency target you are actually trying to hit, how the budget breaks across each layer, the transport architecture that achieves it, and the security and observability instrumentation that keeps it running without surprises.
What Latency Is Acceptable for Voice AI?
Sub-300ms end-to-end latency is the human-conversation threshold. Conversational linguistics research places the average human response gap at 200ms; gaps up to 500ms are within the natural range. Beyond 500ms, listeners register the pause. Beyond 1,500ms, they start to speak again — or hang up.
The practical production target is under 800ms at p95, with a p50 below 400ms. This is not a soft target — these numbers correlate directly with call completion rates and CSAT scores.
End-to-end latency in a voice AI agent is the sum of five contributors:
- Audio capture and VAD (voice activity detection) — 10–30ms
- STT transcription — 80–120ms with streaming
- LLM first-token latency — 150–250ms with low-latency models
- TTS first-audio-chunk — 60–100ms with streaming
- Network transport and jitter buffer — 20–60ms
Total target: 320–560ms. That is achievable. The mistakes that push it over 1,000ms are predictable and avoidable.
Latency Budget by Layer
Voice Activity Detection (10–30ms)
VAD decides when the user has stopped speaking and the pipeline should fire. A misconfigured VAD is the single easiest way to add 500ms of latency without touching any model. Most implementations default to a trailing silence window of 500–800ms — that pause sits entirely in the user experience before a single API call fires.
In production, configure VAD with:
- Silence threshold: 300ms for call center contexts, 200ms for high-tempo applications
- Endpointing: fire on silence, not on a fixed timer
- Echo cancellation: required whenever the agent speaks; browser
getUserMediahandles this withechoCancellation: true
Deepgram's streaming STT includes built-in VAD endpointing via endpointing=300 — use this rather than a separate VAD layer, as it eliminates an additional round-trip.
STT: Streaming Transcription (80–120ms)
Batch transcription — send audio, wait for full transcript — adds 600–1,200ms before your LLM call even starts. This alone makes sub-300ms unreachable. The solution is streaming STT with interim results.
Deepgram Nova-2 delivers streaming transcription with a first-word latency around 80ms over WebSocket. You do not wait for the complete transcript; you begin processing on is_final: true utterances:
User audio → WebSocket → Deepgram Nova-2 (streaming)
↓
interim results (ignored)
↓
is_final: true → LLM pipeline fires
Critical configuration: punctuate=true, smart_format=true, and endpointing=300. Without endpointing set, Deepgram uses server-side silence detection that defaults longer than your VAD window.
LLM Reasoning (150–250ms)
LLM first-token latency is the hardest constraint to optimize. GPT-4 in streaming mode cannot reliably hit sub-200ms first-token in typical network conditions. The model choices that achieve 150–250ms in practice:
- GPT-4o-mini — ~150ms first-token median; suitable for most voice turn completions
- GPT-4o — ~200–300ms first-token; higher quality for complex reasoning turns
- Claude Haiku 4.5 — ~120–180ms first-token; strong instruction-following, well-suited for structured voice turns
- Groq-hosted Llama — sub-100ms first-token via custom hardware; lower model quality ceiling
Stream the response. Pass tokens to TTS as they arrive — do not buffer the full LLM output before starting TTS synthesis. The overlap between LLM generation and TTS synthesis recovers 100–200ms of total latency.
Prompt engineering for voice: system prompts should be shorter than for text chatbots. Strip all markdown formatting instructions — the output goes to TTS and formatted text degrades audio. Keep total context under 2,000 tokens where possible; token count has a near-linear relationship with first-token latency.
TTS: Streaming Synthesis (60–100ms)
ElevenLabs streaming delivers first-audio-chunk in 60–100ms on their Flash tier versus 200–400ms on standard. The difference is significant enough that choosing the wrong tier consumes your entire latency budget on TTS alone.
Use streaming TTS: do not wait for the complete audio file before playback. The client should begin playing as soon as the first audio chunk arrives. For browser clients, the Web Audio API handles chunked playback natively; for telephony, use RTP packetization.
The TTS configuration that matters for latency:
- Model:
eleven_flash_v2_5for minimum latency - Streaming: set
stream=true - Output format:
pcm_16000for telephony,mp3_44100_128for browser - Streaming latency optimization:
optimize_streaming_latency=4(aggressive mode)
Network Transport (20–60ms)
With a well-configured WebRTC connection, transport adds 20–40ms round-trip. With a WebSocket-only approach through a distant cloud region, transport alone can add 200ms in the tail. This is where the transport choice has the most impact.
Full Stack Architecture
The production architecture for a sub-300ms voice AI agent:
The agent worker sits between the media plane and the model APIs. It receives raw audio frames from LiveKit, streams them to Deepgram, fires the LLM on final utterances, and pushes TTS audio frames back into the LiveKit room. The client never calls model APIs directly — this is essential for PII control and rate-limit management.
ICE Trickle and the LiveKit SFU Pattern
Why ICE Trickle Matters
WebRTC connection establishment uses Interactive Connectivity Establishment (ICE) to find a network path between peers. In the naive implementation — wait for all ICE candidates before signaling — setup latency adds 500–2,000ms to every call start. This is invisible in demos and very visible in production.
ICE Trickle solves this: candidates are sent to the remote peer as they are gathered, and connectivity checks begin immediately. Call setup time drops to 100–400ms in most network conditions.
LiveKit implements ICE Trickle automatically. What you need to deploy:
- STUN servers — used for reflexive candidate discovery;
stun.l.google.com:19302works for most cases; deploy your own for HIPAA environments to keep traffic off third-party infrastructure - TURN servers — required for clients behind symmetric NAT, common in enterprise networks; LiveKit's hosted tier includes TURN, or deploy coturn yourself
- Signaling — LiveKit's built-in signaling server handles offer/answer exchange; no separate WebSocket signaling server required
LiveKit SFU Pattern
A Selective Forwarding Unit receives encoded media streams and forwards them to participants without decoding and re-encoding. For voice AI, this matters because:
- The agent worker receives RTP packets from the SFU rather than raw WebRTC — simpler to handle in server-side Python or Node.js code
- Multiple agents or observers can subscribe to the same audio stream without additional encoding cost
- The SFU handles DTLS/SRTP complexity; the agent sees plain RTP internally
The LiveKit room model maps cleanly to a voice call session:
from livekit import agents, rtc
import asyncio
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
async for event in ctx.room.on("track_subscribed"):
if event.track.kind == rtc.TrackKind.KIND_AUDIO:
audio_stream = rtc.AudioStream(event.track)
asyncio.create_task(process_audio(audio_stream, ctx.room))
async def process_audio(stream: rtc.AudioStream, room: rtc.Room):
async for frame in stream:
await pipeline.push_frame(frame)
LiveKit's agent framework handles room lifecycle, track subscription, and RTP framing. Application code focuses on pipeline logic.
WebRTC vs SIP: Which Transport to Use
This is the question that trips up most teams evaluating voice AI infrastructure. They are not competing choices — they solve different integration problems.
| Dimension | WebRTC | SIP / PSTN |
|---|---|---|
| Client target | Browser, iOS, Android app | Phone numbers, legacy PBX, contact centers |
| Audio codec | Opus (wideband, adaptive bitrate) | G.711 μ-law / a-law (narrowband, 8kHz) |
| Setup latency | 100–400ms with ICE Trickle | 200–800ms SIP handshake |
| Infrastructure | STUN/TURN + SFU | SIP trunk + media gateway or SBC |
| NAT traversal | Built-in via ICE | Manual; requires SIP-aware NAT or Session Border Controller |
| PII surface | Controlled; traffic stays in your VPC via TURN relay | Carrier infrastructure; harder to scope a HIPAA BAA over |
| Telephony features | None native (DTMF via data channel) | Full: hold, transfer, DTMF, IVR integration |
Use WebRTC when you control the client — a web app, mobile app, or embedded SDK. It gives you wideband Opus audio (meaningfully better STT accuracy), lower setup latency, and direct control over the media path.
Use SIP when the caller is on a real phone number — inbound calls to a support line, outbound dialer campaigns, or integration with an existing contact center (Genesys, Five9, Twilio PSTN). Twilio's Media Streams provides a WebSocket bridge from PSTN to your agent worker, which avoids running a full SIP stack yourself.
The G.711 codec limitation of PSTN calls has an underappreciated consequence: STT accuracy on 8kHz narrowband audio is meaningfully lower than on 16kHz+ wideband. For healthcare or fintech agents where transcription accuracy directly affects outcomes, browser/mobile WebRTC with Opus gives a material accuracy advantage over telephone calls.
A production voice AI WebRTC architecture typically uses both: WebRTC for app callers and a SIP trunk or Twilio Media Streams for inbound phone calls, with the same agent worker behind both paths.
Observability: What to Instrument
Voice AI pipelines fail silently. A WebRTC ICE failure looks like a dropped call. A Deepgram WebSocket disconnect looks like the agent not hearing the user. A TTS timeout manifests as silence on the line. Without structured observability, every incident is a multi-hour debugging session across three services.
Instrument the following at minimum:
Per-call latency histogram — record wall-clock time from VAD endpoint event to first TTS audio chunk, broken down by component: stt_latency_ms, llm_first_token_ms, tts_first_chunk_ms. Alert on p95 > 800ms for any single component.
Per-call transcription confidence — Deepgram returns a confidence score per utterance. Log confidence distributions; a degradation in median confidence correlates with audio quality issues, codec mismatches, or background noise problems before callers start complaining.
WebRTC ICE connection state — log ICE state transitions (checking → connected → disconnected → failed). Track failed rates by client region. Elevated failure rates in a specific geography usually indicate TURN server coverage gaps.
STT WebSocket reconnections — Deepgram WebSocket connections drop under load or network events. Count reconnections per call. A call with 3+ reconnections will have visible transcription gaps; flag and review these separately.
LLM error rates — log 4xx/5xx rates from your LLM provider independently from total call failure. A 429 spike during peak hours needs a different response (add capacity, queue calls) than a 500 (inspect payloads, contact provider).
Use structured logging with a call_id field on every log event. Voice AI incidents always span Deepgram, your agent worker, and your SFU. Without a consistent call_id, joining those log lines across services is impossible.
Building This for Production
The full stack — streaming STT, low-latency LLM, streaming TTS, WebRTC via LiveKit SFU, structured observability, and pre-LLM PII redaction — is the minimum viable architecture for a voice AI agent that holds up in production. The demo that skips any of these layers will surface its gaps within the first 200 calls.
The latency numbers are achievable. Sub-400ms p50 is not theoretical — it is what Prodinit operates for a healthcare scheduling platform handling 2000+ calls per day in a HIPAA-covered environment. The architecture works because every layer is optimized independently and composed correctly.
Prodinit builds production voice AI platforms for healthcare, fintech, and B2B SaaS teams — from architecture design through production deployment. If you're evaluating the stack for your use case or need to move from prototype to production, explore our Custom AI Development service or book a 30-minute technical call.
Frequently Asked Questions
Sub-300ms end-to-end is the human-conversation threshold — the point where a response feels natural and immediate. In production, target a p50 below 400ms and p95 below 800ms. Real-time voice AI latency above 1,500ms consistently degrades conversational experience: callers interrupt, speak over the agent, or disengage entirely. The budget breaks roughly as 80–120ms STT, 150–250ms LLM first-token, 60–100ms TTS first-chunk, and 20–60ms transport.
Use WebRTC when you control the client — a browser app, iOS app, or Android app. It gives you wideband Opus audio with better STT accuracy, ICE Trickle for fast call setup, and a controllable media path suitable for HIPAA environments. Use SIP when integrating with phone numbers, PSTN trunks, or existing contact center infrastructure. These are complementary: a production platform often uses WebRTC for app callers and a SIP trunk or Twilio Media Streams for inbound phone calls, with the same agent worker behind both.
LiveKit is a Selective Forwarding Unit — a media server that routes encoded audio streams between participants without decoding and re-encoding. For voice AI, the SFU handles DTLS/SRTP negotiation, ICE Trickle, and jitter buffering so your agent worker receives clean RTP packets rather than raw WebRTC frames. LiveKit's Python agent SDK integrates directly with the SFU room model, removing most WebRTC plumbing from application code and letting you focus on the pipeline logic instead.
Redact before the LLM call, not after. In a HIPAA or PCI context, the raw transcript should never reach an LLM API — redaction must happen in your agent worker before the text leaves your infrastructure. Use layered redaction: regex patterns for structured PII (SSN, credit card numbers, dates of birth) and an NER model for unstructured PHI (names, addresses). AWS Comprehend Medical is a managed option for healthcare; a self-hosted spaCy pipeline keeps data within your own VPC. Verify that your STUN, TURN, SFU, and model API calls are all covered under any HIPAA BAA you sign.
A Python voice agent worker is IO-bound, not CPU-bound. With asyncio, a single process handles 20–50 concurrent calls before latency degrades, depending on LLM response times and network IO. Horizontal scaling is straightforward: add worker instances behind a load balancer, with LiveKit routing each room to the least-loaded worker. For a 2000+ calls/day platform, 3–5 worker instances with auto-scaling based on active call count provides adequate headroom for peak loads without over-provisioning at idle.