How to Build Voice AI Under 500ms End-to-End
A detailed breakdown of the streaming pipeline: Deepgram Nova-3 for STT, LLM with first-token streaming, ElevenLabs Flash for TTS, and how to pipeline them so the caller hears a response before the LLM finishes generating.
500ms is the threshold for "natural" conversation. Below it, a voice AI feels responsive. Above it, it feels like a lagging phone call — and users hang up.
Most voice AI tutorials show you how to get audio in and audio out. Almost none explain how to do it fast enough that the caller can't tell they're talking to a machine. This post does.
The latency budget
Every voice AI exchange has three irreducible components:
| Component | Typical range | Optimized range |
|---|---|---|
| Speech-to-text (STT) | 200–800ms | 100–200ms |
| LLM first token | 300–1500ms | 150–400ms |
| Text-to-speech (TTS) | 200–600ms | 80–150ms |
| Total (naive sequential) | 700–2900ms | — |
| Total (streamed pipeline) | — | 350–600ms |
The key word is streamed. In a naive sequential pipeline, you wait for STT to finish, then send to LLM, then send LLM output to TTS. End-to-end latency is the sum of all three.
In a properly pipelined streaming architecture, you overlap all three stages. The caller hears audio starting within ~350ms of finishing their sentence.
Component selection and benchmarks
STT: Deepgram Nova-3
Deepgram Nova-3 is currently the fastest production STT option for English and Arabic. Benchmarks from our production deployments:
| Metric | Value |
|---|---|
| Time-to-first-transcript (streaming) | 80–150ms |
| Word error rate (English) | ~3.5% |
| Word error rate (Arabic MSA) | ~8% |
| Latency at P95 | <200ms |
| Supported Arabic dialects | MSA, Gulf, Egyptian |
Key Deepgram feature: endpointing. When a speaker pauses for ≥300ms, Deepgram fires a SpeechFinal event. This is your trigger to send the accumulated transcript to the LLM. You do not wait for the stream to close — you process on pause detection.
LLM: Streaming first-token matters more than total speed
The LLM latency that matters for voice is time-to-first-token (TTFT), not the time to generate the full response. You start TTS synthesis as the first few words arrive.
| Model | TTFT (typical) | Notes |
|---|---|---|
| GPT-4o | 200–350ms | Best balance for voice |
| Claude 3 Haiku | 150–250ms | Fastest Anthropic model |
| Gemini 1.5 Flash | 150–300ms | Good multilingual |
| Local Qwen3.5 4B (6GB GPU) | 400–800ms | On-prem, acceptable for non-RT |
For sub-500ms voice, use GPT-4o or Claude Haiku with streaming enabled. Local models add 200–400ms to TTFT — viable for privacy-sensitive deployments with relaxed latency requirements.
TTS: ElevenLabs Flash v2.5
ElevenLabs Flash v2.5 is purpose-built for realtime voice applications:
| Metric | Value |
|---|---|
| Time-to-first-audio-chunk | 75–120ms |
| Output latency (streaming) | ~80ms |
| Arabic voice quality | High (native voices available) |
| Supported output formats | PCM 16-bit, MP3, μ-law (Twilio) |
The key: use ElevenLabs' websocket streaming API, not the REST endpoint. The REST endpoint returns audio only after the full text is processed. The WebSocket streams audio chunks as the TTS model generates them, reducing time-to-first-audio from ~400ms to ~80ms.
The streaming pipeline architecture
Caller audio (Twilio/WebRTC)
│
▼
[Deepgram WebSocket]
│ Streams partial transcripts
│ Fires SpeechFinal on pause
│
▼
[FastAPI WebSocket server]
│ Accumulates transcript
│ On SpeechFinal: sends to LLM
│
▼
[LLM (streaming)]
│ First token arrives ~200ms
│ Tokens stream as generated
│
▼
[Token buffer] ──► [ElevenLabs WebSocket]
│ Buffer until first sentence end │ Streams audio chunks
│ "." "!" "?" "..." triggers flush │
▼ │
[Audio buffer] ◄────────────────────────┘
│
▼
Caller hears audio (~350-450ms after speaking)
The critical optimization: don't wait for the full LLM response to start TTS. Buffer LLM tokens until you have a complete sentence (detected by sentence-ending punctuation), then flush that sentence to ElevenLabs. The caller starts hearing the first sentence while the LLM is still generating the second.
Implementation
FastAPI WebSocket server core
import asyncio
from fastapi import FastAPI, WebSocket
from deepgram import DeepgramClient, LiveOptions
from elevenlabs import ElevenLabs, VoiceSettings
from openai import AsyncOpenAI
app = FastAPI()
dg_client = DeepgramClient(api_key=DEEPGRAM_API_KEY)
el_client = ElevenLabs(api_key=ELEVENLABS_API_KEY)
oai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
@app.websocket("/voice")
async def voice_handler(ws: WebSocket):
await ws.accept()
transcript_buffer = []
# Open Deepgram streaming session
dg_connection = await dg_client.listen.asynclive.v("1").start(
LiveOptions(
model="nova-3",
language="en-US", # or "ar" for Arabic
smart_format=True,
utterance_end_ms=300, # fire SpeechFinal after 300ms pause
vad_events=True,
)
)
async def on_transcript(result, **kwargs):
sentence = result.channel.alternatives[0].transcript
if result.is_final:
transcript_buffer.append(sentence)
if result.speech_final:
# Full utterance detected — process it
full_transcript = " ".join(transcript_buffer)
transcript_buffer.clear()
if full_transcript.strip():
asyncio.create_task(
stream_response(ws, full_transcript)
)
dg_connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
# Pipe incoming audio to Deepgram
async for audio_chunk in ws.iter_bytes():
await dg_connection.send(audio_chunk)
async def stream_response(ws: WebSocket, user_text: str):
"""LLM → sentence buffer → ElevenLabs → caller."""
# Open ElevenLabs WebSocket
el_ws = await el_client.text_to_speech.convert_realtime(
voice_id="pNInz6obpgDQGcFmaJgB", # your voice ID
model_id="eleven_flash_v2_5",
voice_settings=VoiceSettings(stability=0.5, similarity_boost=0.75),
output_format="ulaw_8000", # Twilio-compatible
)
sentence_buffer = ""
async def flush_to_tts(text: str):
await el_ws.send(text)
# Stream from LLM
stream = await oai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_text},
],
stream=True,
max_tokens=200, # keep responses concise for voice
temperature=0.3,
)
async for chunk in stream:
token = chunk.choices[0].delta.content or ""
sentence_buffer += token
# Flush on sentence boundaries
if any(sentence_buffer.rstrip().endswith(p) for p in (".", "!", "?", "...")):
await flush_to_tts(sentence_buffer)
sentence_buffer = ""
# Flush any remaining text
if sentence_buffer.strip():
await flush_to_tts(sentence_buffer)
await el_ws.close()
# Stream TTS audio back to caller
async for audio_chunk in el_ws.audio_chunks():
await ws.send_bytes(audio_chunk)
Twilio integration
For phone-based deployments, Twilio streams caller audio over WebSocket to your server via <Stream>:
<!-- TwiML response when call comes in -->
<Response>
<Connect>
<Stream url="wss://your-server.com/voice" />
</Connect>
</Response>
Twilio streams 8kHz μ-law audio. Deepgram accepts it directly. ElevenLabs TTS output should be configured to ulaw_8000 to match. This avoids a transcoding step that would add 10–30ms.
Bilingual (Arabic + English) setup
For MENA deployments, the pipeline handles Arabic natively with minor config changes:
# Deepgram: Arabic language detection
LiveOptions(
model="nova-3",
language="multi", # or "ar" for Arabic-only
detect_language=True, # auto-detect per utterance
)
# ElevenLabs: Arabic voice
voice_id = "arabic_voice_id" # use an Arabic-native voice model
# LLM: bilingual system prompt
SYSTEM_PROMPT = """You are a helpful assistant.
Respond in the same language the user speaks.
If the user speaks Arabic, respond in Arabic.
If the user speaks English, respond in English.
Keep responses concise — under 40 words for voice."""
Arabic voice quality with Deepgram Nova-3 is significantly better on MSA (Modern Standard Arabic) than on dialects. For Gulf clinic deployments, we fine-tune the model prompting to handle code-switching (users switching mid-sentence between Arabic and English), which is extremely common in Gulf business contexts.
Common latency killers to avoid
1. Using the REST API for Deepgram or ElevenLabs. The REST API processes audio synchronously and returns only when complete. Always use the WebSocket API for realtime applications.
2. Large LLM context windows. Every token in the system prompt adds to TTFT. Keep system prompts under 500 tokens. For voice, shorter is always better — the user can't read a long context.
3. Generating long responses. Voice responses should be 20–50 words maximum per turn. Cap max_tokens at 150–200. If the LLM generates a 400-word response, the caller hears dead air for 3+ seconds while TTS processes it all.
4. Sequential pipeline. If you're waiting for STT to finish → then LLM → then TTS, you're leaving 400–600ms on the table. Stream and overlap.
5. Audio encoding mismatch. Twilio uses 8kHz μ-law. If your TTS outputs 44kHz PCM, you're adding a transcoding step. Match your encoding to your telephony provider's native format.
Frequently asked questions
What's the minimum server spec for a production voice AI? A single 4-core server with 8GB RAM handles 15–20 concurrent voice sessions with GPT-4o as the LLM. The bottleneck is network I/O and LLM API latency, not CPU. For on-prem LLM (Qwen, Mistral), add a GPU server — a single A10G handles 8–10 concurrent voice sessions at acceptable latency.
How do you handle interruptions (barge-in)?
Deepgram fires an UtteranceEnd event when a new speech segment begins. Monitor for speech while TTS is playing. When the caller speaks, cancel the current TTS stream and start a new LLM call with the new input. ElevenLabs WebSocket supports mid-stream cancellation.
Does this architecture scale? Yes. The WebSocket server is stateless per connection — scale horizontally behind a load balancer. Each server handles 15–20 concurrent sessions. At 100 concurrent calls, you need 5–7 server instances. Redis pub/sub can coordinate state if a call needs to be handed between servers.
Can voice AI handle accented English? Deepgram Nova-3 has notably better accent robustness than competitor models, particularly for Gulf-accented English and Indian-accented English. Word error rates for Gulf-accented English average 5–8%, compared to 3–4% for native US English — acceptable for production use.
→ LangGraph Development: 5 Patterns for Production-Safe Agents