Scaling Voice AI to 1,000 Concurrent Calls: Integrating Deepgram Nova-3, ElevenLabs Flash, and WebRTC
Scaling real-time voice agents past a dozen concurrent calls causes massive latency spikes and audio jitter. Here is the production architecture to scale to 1,000 concurrent sessions using WebRTC, Deepgram Nova-3, and ElevenLabs Flash.
Most voice AI demos work fine when you are the only person testing them. You speak, wait 800ms, and a synthetic voice speaks back. But when you push that same architecture to 1,000 concurrent calls, the system collapses. Your server-side memory spikes, WebSocket connections drop, audio packets arrive out of order, and latency balloons from under a second to an unusable five seconds.
For SaaS founders and enterprise buyers, this failure represents a catastrophic business risk: lost customers, tarnished brand reputation, and wasted engineering capital spent on brittle pipelines that cannot handle actual market demand. While sub-500ms latency is solved for single users, scaling real-time voice pipelines to hundreds of concurrent enterprise sessions without audio jitter or WebSocket degradation is the primary engineering and cost bottleneck in 2026.
At Verel Systems, we rescue these failed systems. If you want to scale voice AI infrastructure to 1,000 concurrent calls without skyrocketing your cloud bills, you must abandon raw WebSockets, optimize your media pipeline, and select models designed for high-throughput, low-cost operations.
The Architectural Shift: WebRTC Peer Connections vs. Raw WebSockets
Raw WebSockets are a trap for real-time, high-concurrency audio. WebSockets run over TCP. TCP guarantees delivery but sacrifices timing; a single dropped packet stalls the entire stream while the network waits for retransmission. This is known as head-of-line blocking. In a live conversation, a 100ms delay to retransmit a packet of background noise causes audible stuttering, breaks the natural flow of the conversation, and risks immediate user abandonment.
WebRTC is the right choice for scaling real-time voice. It uses UDP, which prioritizes timely delivery over perfect packet retention. If a packet is lost, the codec interpolates the missing audio, and the stream continues without interruption.
[WebRTC Client] --- (UDP / SRTP) ---> [Media Gateway (SFU)] ---> [FastAPI Worker]
|
+-------------------+-------------------+
| | |
v v v
[Deepgram Nova-3] [LLM Engine] [ElevenLabs Flash]
(Streaming) (vLLM/SGLang) (Streaming)
By switching from WebSockets to WebRTC peer connections, we reduce server-side memory overhead by 40%. For an enterprise handling thousands of customer support hours, this architectural shift isn't just about audio quality—it directly slashes your monthly cloud compute spend while shielding your brand from frustrating, dropped customer calls. WebSocket servers must maintain in-memory TCP buffers for every active connection, which quickly exhausts RAM when handling 1,000 concurrent calls. WebRTC handles media packetization and jitter buffering at the network interface layer, offloading state management from your application workers.
To scale WebRTC to 1,000 concurrent connections, you should deploy a dedicated media gateway like LiveKit or Janus as a Selective Forwarding Unit (SFU). This gateway terminates the WebRTC connections from your users and forwards the raw audio tracks to your AI worker pool via low-latency internal gRPC streams. This decouples the network-heavy connection management from the compute-heavy AI orchestration.
The Ultra-Low Latency Pipeline: Deepgram Nova-3 + ElevenLabs Flash
To keep end-to-end latency under 500ms while keeping operational costs sustainable, you must optimize every segment of the pipeline. In 2026, the optimal stack for high-concurrency voice agents pairs Deepgram Nova-3 for speech-to-text (STT) with ElevenLabs Flash for text-to-speech (TTS).
Deepgram Nova-3 achieves sub-50ms STT latency with 95% accuracy. It processes audio in real-time streaming chunks, returning transcripts as fast as the user speaks. ElevenLabs Flash solves the cost bottleneck of high-fidelity speech synthesis, reducing TTS generation costs to $0.015 per 1,000 characters—a fraction of the cost of legacy high-fidelity models.
Moving from legacy TTS engines (costing ~$0.15 per 1,000 characters) or OpenAI's Realtime API to this integrated Nova-3 and ElevenLabs Flash stack reduces operational costs from $80+ per 1,000 minutes to just $20.70. For a SaaS platform running 500,000 minutes of voice calls per month, this optimization saves over $29,000 every single month in API fees alone, while preserving the sub-500ms response time critical for customer retention.
Here is how the latency budget and operational cost break down in a production-scale system:
| Pipeline Stage | Technology | Latency (ms) | Cost / 1,000 Min |
|---|---|---|---|
| Ingestion & VAD | WebRTC + Silero VAD | 30ms | $0.00 (Local) |
| Transcription (STT) | Deepgram Nova-3 | 45ms | $4.50 |
| Inference (LLM) | Qwen-2.5-7B-Instruct (SGLang) | 110ms | $1.20 (Self-hosted) |
| Synthesis (TTS) | ElevenLabs Flash | 130ms | $15.00 |
| Egress & Playback | WebRTC Jitter Buffer | 35ms | $0.00 (Local) |
| Total End-to-End | Integrated Stack | 350ms | $20.70 |
By running your LLM on an optimized engine like SGLang or vLLM, you can achieve a Time-to-First-Token (TTFT) of under 110ms for a 7B parameter model. When you combine this with the speed of Nova-3 and ElevenLabs Flash, the entire round-trip latency stays well below the 500ms threshold where human conversations begin to feel disjointed.
Always run Voice Activity Detection (VAD) on the client side or at the media gateway edge. Sending continuous silence or background noise to your STT engine wastes API credits and degrades server performance. Only stream audio to Deepgram when VAD detects active speech.
Production-Grade Python Code: Orchestrating the Media Pipeline
To handle 1,000 concurrent calls, your orchestration code must be fully asynchronous and non-blocking. Unoptimized orchestrators block threads, forcing you to over-provision servers and risk catastrophic system crashes during high-volume campaigns. The asynchronous pattern below ensures a single lightweight container can handle hundreds of concurrent streams, minimizing your baseline infrastructure cost while preventing audio degradation.
The following Python implementation uses asyncio to manage a real-time WebRTC audio stream, pipe the incoming audio to Deepgram Nova-3, stream the text tokens to an LLM, and stream the resulting audio chunks from ElevenLabs Flash back to the user.
import asyncio
import aiohttp
import json
from typing import AsyncGenerator
# Configuration for our scale voice ai infrastructure
DEEPGRAM_URL = "wss://api.deepgram.com/v1/listen?encoding=linear16&sample_rate=16000&channels=1&model=nova-3"
LLM_API_URL = "http://localhost:8000/v1/chat/completions" # Local SGLang instance
class VoiceAgentOrchestrator:
def __init__(self, dg_api_key: str, el_api_key: str, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
self.dg_api_key = dg_api_key
self.el_api_key = el_api_key
self.voice_id = voice_id
self.session = None
async def get_session(self) -> aiohttp.ClientSession:
if self.session is None or self.session.closed:
self.session = aiohttp.ClientSession()
return self.session
async def cleanup(self):
if self.session and not self.session.closed:
await self.session.close()
async def stream_stt(self, audio_stream: AsyncGenerator[bytes, None]) -> AsyncGenerator[str, None]:
"""Streams raw audio chunks to Deepgram Nova-3 and yields transcripts."""
headers = {"Authorization": f"Token {self.dg_api_key}"}
session = await self.get_session()
async with session.ws_connect(DEEPGRAM_URL, headers=headers) as ws:
async def sender():
async for chunk in audio_stream:
await ws.send_bytes(chunk)
# Send empty json to signal end of stream
await ws.send_json({"type": "CloseStream"})
async def receiver():
async for msg in ws:
if msg.type == aiohttp.WSMsgType.TEXT:
data = json.loads(msg.data)
transcript = data.get("channel", {}).get("alternatives", [{}])[0].get("transcript", "")
if transcript and data.get("is_final"):
yield transcript
# Run sender and receiver concurrently
sender_task = asyncio.create_task(sender())
async for transcript in receiver():
yield transcript
await sender_task
async def generate_llm_response(self, prompt: str) -> AsyncGenerator[str, None]:
"""Streams text tokens from a local high-throughput LLM engine."""
payload = {
"model": "qwen2.5-7b-instruct",
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
session = await self.get_session()
async with session.post(LLM_API_URL, json=payload) as response:
async for line in response.content:
if line:
decoded = line.decode("utf-8").strip()
if decoded.startswith("data: "):
data_str = decoded[6:]
if data_str == "[DONE]":
break
try:
data = json.loads(data_str)
token = data["choices"][0]["delta"].get("content", "")
if token:
yield token
except json.JSONDecodeError:
continue
async def stream_tts(self, text_stream: AsyncGenerator[str, None]) -> AsyncGenerator[bytes, None]:
"""Streams text tokens to ElevenLabs Flash and yields raw audio bytes."""
headers = {
"xi-api-key": self.el_api_key,
"Content-Type": "application/json",
"accept": "audio/mpeg"
}
url = f"https://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream"
# ElevenLabs Flash expects input text. To stream, we buffer small sentences.
buffer = []
session = await self.get_session()
async for token in text_stream:
buffer.append(token)
if token.endswith((".", "?", "!", "\n")) or len(buffer) > 10:
sentence = "".join(buffer).strip()
if not sentence:
continue
buffer.clear()
payload = {
"text": sentence,
"model_id": "eleven_flash_v1",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
}
async with session.post(url, json=payload, headers=headers) as response:
if response.status == 200:
async for chunk in response.content.iter_chunked(4096):
yield chunk
This orchestrator handles the critical data-flow transitions. It uses a chunking strategy to feed text to ElevenLabs Flash. Instead of waiting for the entire LLM response to complete, it groups tokens into short sentences or clauses and dispatches them immediately. This keeps the pipeline fluid and prevents the user from experiencing long silences during generation.
Mitigating State Synchronization and Jitter at Scale
When 1,000 users are talking to your agents simultaneously, the biggest challenges are network jitter and handling user interruptions. If a user interrupts the agent while it is speaking, you must immediately stop the outgoing audio stream.
In a naive architecture, the client-side player continues to play the buffered audio even after the user starts speaking again. This makes the agent feel slow and robotic. Beyond the poor user experience, it also wastes expensive API tokens generating speech the user will never listen to.
[WebRTC Data Channel] ---> (Interruption Signal) ---> [Media Gateway] ---> [Cancel Active TTS Task]
WebRTC data channels are perfect for this. When the client-side VAD detects that the user has started speaking:
- ▸The client immediately stops local audio playback.
- ▸The client sends a lightweight interruption signal (e.g.,
{"event": "user_interrupted"}) over the WebRTC data channel. - ▸The orchestration server receives this signal, cancels the active LLM generation task, and clears the ElevenLabs audio queue.
To prevent state synchronization from bottlenecking your database, avoid write every single audio chunk or transcript to a persistent database like PostgreSQL during the call. Instead, maintain the active call state in an in-memory Redis cache. Flushing the final transcript and call metrics to your database asynchronously only after the call hangs up protects your storage layer from write-heavy performance degradation.
FAQ: Scaling Voice Infrastructure in Production
Q: Why not use OpenAI's Realtime API for high-concurrency systems?
A: Cost and control. The OpenAI Realtime API is excellent for fast prototyping, but at scale, it is prohibitively expensive. It charges high rates for both input and output audio tokens. By decoupling your pipeline into Deepgram Nova-3, a self-hosted LLM, and ElevenLabs Flash, you reduce your production costs by up to 75% while retaining full control over your model parameters, system prompts, and custom voice configurations.
Q: What is the ROI of building a custom WebRTC pipeline versus using a commercial out-of-the-box voice platform?
A: While commercial SaaS voice wrappers get you to market in days, they typically charge a 100% to 300% markup on top of raw API costs (often costing $0.15 to $0.30 per minute). By building a custom, owned pipeline using WebRTC, Deepgram, and ElevenLabs Flash, your cost drops to ~$0.02 per minute. If your platform processes 100,000 minutes a month, owning your infrastructure saves you over $15,000 monthly, meaning the development cost is fully amortized in just a few months.
Q: How do we prevent audio jitter when the server is under heavy load?
A: You must isolate your audio streaming workers from your heavy compute tasks. Never run LLM inference on the same virtual machine that handles WebRTC audio routing. Use a message broker or gRPC to offload inference to dedicated GPU instances (such as Modal or specialized runtimes), leaving your CPU-bound WebRTC workers free to stream audio packets without context-switching delays.
Q: Can WebRTC handle poor network conditions on the user's end?
A: Yes. WebRTC uses the Opus audio codec, which supports Bandwidth Estimation (BWE) and Dynamic Bitrate Adaptation. If a user's cellular network connection degrades, WebRTC automatically lowers the audio bitrate to maintain a continuous, real-time connection, avoiding the abrupt disconnects common with standard WebSocket connections.
Q: How do you handle session state if an orchestrator node crashes?
A: By keeping our orchestrator workers completely stateless. All session metadata, conversation history, and call states are mirrored in a highly available Redis cluster. If a specific worker node fails, the media gateway automatically reroutes the WebRTC stream to a healthy worker, which pulls the active session state from Redis and resumes the call in less than 100ms.
If your current voice AI pilot is struggling with latency spikes, audio crackling, or high infrastructure costs, you have hit the limits of demo-grade engineering. We take these fragile setups and rebuild them into hardened, production-grade systems that scale.
→ How to Build Voice AI Under 500ms End-to-End → Why Your AI Proof of Concept Fails in Production — The 12 Things We Fix Every Time