How to Build Voice AI Under 500ms End-to-End
Voice AI 10 min2025-11-01

How to Build Voice AI Under 500ms End-to-End

A detailed breakdown of the streaming pipeline: Deepgram Nova-3 for STT, LLM with first-token streaming, ElevenLabs Flash for TTS, and how to pipeline them so the caller hears a response before the LLM finishes generating.

500ms is the threshold for "natural" conversation. Below it, a voice AI feels responsive. Above it, it feels like a lagging phone call — and users hang up.

Most voice AI tutorials show you how to get audio in and audio out. Almost none explain how to do it fast enough that the caller can't tell they're talking to a machine. This post does.

The latency budget

Every voice AI exchange has three irreducible components:

ComponentTypical rangeOptimized range
Speech-to-text (STT)200–800ms100–200ms
LLM first token300–1500ms150–400ms
Text-to-speech (TTS)200–600ms80–150ms
Total (naive sequential)700–2900ms
Total (streamed pipeline)350–600ms

The key word is streamed. In a naive sequential pipeline, you wait for STT to finish, then send to LLM, then send LLM output to TTS. End-to-end latency is the sum of all three.

In a properly pipelined streaming architecture, you overlap all three stages. The caller hears audio starting within ~350ms of finishing their sentence.

Component selection and benchmarks

STT: Deepgram Nova-3

Deepgram Nova-3 is currently the fastest production STT option for English and Arabic. Benchmarks from our production deployments:

MetricValue
Time-to-first-transcript (streaming)80–150ms
Word error rate (English)~3.5%
Word error rate (Arabic MSA)~8%
Latency at P95<200ms
Supported Arabic dialectsMSA, Gulf, Egyptian

Key Deepgram feature: endpointing. When a speaker pauses for ≥300ms, Deepgram fires a SpeechFinal event. This is your trigger to send the accumulated transcript to the LLM. You do not wait for the stream to close — you process on pause detection.

LLM: Streaming first-token matters more than total speed

The LLM latency that matters for voice is time-to-first-token (TTFT), not the time to generate the full response. You start TTS synthesis as the first few words arrive.

ModelTTFT (typical)Notes
GPT-4o200–350msBest balance for voice
Claude 3 Haiku150–250msFastest Anthropic model
Gemini 1.5 Flash150–300msGood multilingual
Local Qwen3.5 4B (6GB GPU)400–800msOn-prem, acceptable for non-RT

For sub-500ms voice, use GPT-4o or Claude Haiku with streaming enabled. Local models add 200–400ms to TTFT — viable for privacy-sensitive deployments with relaxed latency requirements.

TTS: ElevenLabs Flash v2.5

ElevenLabs Flash v2.5 is purpose-built for realtime voice applications:

MetricValue
Time-to-first-audio-chunk75–120ms
Output latency (streaming)~80ms
Arabic voice qualityHigh (native voices available)
Supported output formatsPCM 16-bit, MP3, μ-law (Twilio)

The key: use ElevenLabs' websocket streaming API, not the REST endpoint. The REST endpoint returns audio only after the full text is processed. The WebSocket streams audio chunks as the TTS model generates them, reducing time-to-first-audio from ~400ms to ~80ms.

The streaming pipeline architecture

Caller audio (Twilio/WebRTC)
    │
    ▼
[Deepgram WebSocket]
    │  Streams partial transcripts
    │  Fires SpeechFinal on pause
    │
    ▼
[FastAPI WebSocket server]
    │  Accumulates transcript
    │  On SpeechFinal: sends to LLM
    │
    ▼
[LLM (streaming)]
    │  First token arrives ~200ms
    │  Tokens stream as generated
    │
    ▼
[Token buffer] ──► [ElevenLabs WebSocket]
    │  Buffer until first sentence end   │  Streams audio chunks
    │  "." "!" "?" "..." triggers flush  │
    ▼                                    │
[Audio buffer]  ◄────────────────────────┘
    │
    ▼
Caller hears audio (~350-450ms after speaking)

The critical optimization: don't wait for the full LLM response to start TTS. Buffer LLM tokens until you have a complete sentence (detected by sentence-ending punctuation), then flush that sentence to ElevenLabs. The caller starts hearing the first sentence while the LLM is still generating the second.

Implementation

FastAPI WebSocket server core

import asyncio
from fastapi import FastAPI, WebSocket
from deepgram import DeepgramClient, LiveOptions
from elevenlabs import ElevenLabs, VoiceSettings
from openai import AsyncOpenAI

app = FastAPI()
dg_client = DeepgramClient(api_key=DEEPGRAM_API_KEY)
el_client = ElevenLabs(api_key=ELEVENLABS_API_KEY)
oai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

@app.websocket("/voice")
async def voice_handler(ws: WebSocket):
    await ws.accept()
    
    transcript_buffer = []
    
    # Open Deepgram streaming session
    dg_connection = await dg_client.listen.asynclive.v("1").start(
        LiveOptions(
            model="nova-3",
            language="en-US",          # or "ar" for Arabic
            smart_format=True,
            utterance_end_ms=300,      # fire SpeechFinal after 300ms pause
            vad_events=True,
        )
    )
    
    async def on_transcript(result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if result.is_final:
            transcript_buffer.append(sentence)
        if result.speech_final:
            # Full utterance detected — process it
            full_transcript = " ".join(transcript_buffer)
            transcript_buffer.clear()
            if full_transcript.strip():
                asyncio.create_task(
                    stream_response(ws, full_transcript)
                )
    
    dg_connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
    
    # Pipe incoming audio to Deepgram
    async for audio_chunk in ws.iter_bytes():
        await dg_connection.send(audio_chunk)

async def stream_response(ws: WebSocket, user_text: str):
    """LLM → sentence buffer → ElevenLabs → caller."""
    
    # Open ElevenLabs WebSocket
    el_ws = await el_client.text_to_speech.convert_realtime(
        voice_id="pNInz6obpgDQGcFmaJgB",  # your voice ID
        model_id="eleven_flash_v2_5",
        voice_settings=VoiceSettings(stability=0.5, similarity_boost=0.75),
        output_format="ulaw_8000",  # Twilio-compatible
    )
    
    sentence_buffer = ""
    
    async def flush_to_tts(text: str):
        await el_ws.send(text)
    
    # Stream from LLM
    stream = await oai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": user_text},
        ],
        stream=True,
        max_tokens=200,       # keep responses concise for voice
        temperature=0.3,
    )
    
    async for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        sentence_buffer += token
        
        # Flush on sentence boundaries
        if any(sentence_buffer.rstrip().endswith(p) for p in (".", "!", "?", "...")):
            await flush_to_tts(sentence_buffer)
            sentence_buffer = ""
    
    # Flush any remaining text
    if sentence_buffer.strip():
        await flush_to_tts(sentence_buffer)
    
    await el_ws.close()
    
    # Stream TTS audio back to caller
    async for audio_chunk in el_ws.audio_chunks():
        await ws.send_bytes(audio_chunk)

Twilio integration

For phone-based deployments, Twilio streams caller audio over WebSocket to your server via <Stream>:

<!-- TwiML response when call comes in -->
<Response>
  <Connect>
    <Stream url="wss://your-server.com/voice" />
  </Connect>
</Response>

Twilio streams 8kHz μ-law audio. Deepgram accepts it directly. ElevenLabs TTS output should be configured to ulaw_8000 to match. This avoids a transcoding step that would add 10–30ms.

Bilingual (Arabic + English) setup

For MENA deployments, the pipeline handles Arabic natively with minor config changes:

# Deepgram: Arabic language detection
LiveOptions(
    model="nova-3",
    language="multi",           # or "ar" for Arabic-only
    detect_language=True,       # auto-detect per utterance
)

# ElevenLabs: Arabic voice
voice_id = "arabic_voice_id"   # use an Arabic-native voice model

# LLM: bilingual system prompt
SYSTEM_PROMPT = """You are a helpful assistant. 
Respond in the same language the user speaks.
If the user speaks Arabic, respond in Arabic.
If the user speaks English, respond in English.
Keep responses concise — under 40 words for voice."""

Arabic voice quality with Deepgram Nova-3 is significantly better on MSA (Modern Standard Arabic) than on dialects. For Gulf clinic deployments, we fine-tune the model prompting to handle code-switching (users switching mid-sentence between Arabic and English), which is extremely common in Gulf business contexts.

Voice AI & Automation
Sub-500ms bilingual voice agents for inbound/outbound. Arabic and English. Twilio, WebRTC, or SIP. $5K–$15K.

Common latency killers to avoid

1. Using the REST API for Deepgram or ElevenLabs. The REST API processes audio synchronously and returns only when complete. Always use the WebSocket API for realtime applications.

2. Large LLM context windows. Every token in the system prompt adds to TTFT. Keep system prompts under 500 tokens. For voice, shorter is always better — the user can't read a long context.

3. Generating long responses. Voice responses should be 20–50 words maximum per turn. Cap max_tokens at 150–200. If the LLM generates a 400-word response, the caller hears dead air for 3+ seconds while TTS processes it all.

4. Sequential pipeline. If you're waiting for STT to finish → then LLM → then TTS, you're leaving 400–600ms on the table. Stream and overlap.

5. Audio encoding mismatch. Twilio uses 8kHz μ-law. If your TTS outputs 44kHz PCM, you're adding a transcoding step. Match your encoding to your telephony provider's native format.

Frequently asked questions

What's the minimum server spec for a production voice AI? A single 4-core server with 8GB RAM handles 15–20 concurrent voice sessions with GPT-4o as the LLM. The bottleneck is network I/O and LLM API latency, not CPU. For on-prem LLM (Qwen, Mistral), add a GPU server — a single A10G handles 8–10 concurrent voice sessions at acceptable latency.

How do you handle interruptions (barge-in)? Deepgram fires an UtteranceEnd event when a new speech segment begins. Monitor for speech while TTS is playing. When the caller speaks, cancel the current TTS stream and start a new LLM call with the new input. ElevenLabs WebSocket supports mid-stream cancellation.

Does this architecture scale? Yes. The WebSocket server is stateless per connection — scale horizontally behind a load balancer. Each server handles 15–20 concurrent sessions. At 100 concurrent calls, you need 5–7 server instances. Redis pub/sub can coordinate state if a call needs to be handed between servers.

Can voice AI handle accented English? Deepgram Nova-3 has notably better accent robustness than competitor models, particularly for Gulf-accented English and Indian-accented English. Word error rates for Gulf-accented English average 5–8%, compared to 3–4% for native US English — acceptable for production use.

LangGraph Development: 5 Patterns for Production-Safe Agents

Related services