RAG 10 min2026-04-15

On-Prem LLM Speed: How to Get 3× More Throughput Without Buying New Hardware

If your self-hosted LLM feels slow, the bottleneck is almost never the model. It's the serving stack around it. The right inference engine alone can triple your throughput. Here's the hierarchy of levers, with real benchmark numbers.

The conversation about on-premise AI usually focuses on privacy and control. Rarely on what it takes to make it fast enough that users don't complain.

Here's the uncomfortable truth about most on-prem LLM deployments: the hardware is not the bottleneck. The serving stack is. Teams spend $20,000 on a GPU server, run Ollama on it, and get throughput numbers that would embarrass a 2022 API call. Then they conclude that on-prem AI "isn't production-ready."

It is. They just picked the wrong engine.

The lever that matters most: inference engine selection

This is the biggest single variable in on-prem LLM performance, and it's often the last thing teams consider.

Benchmarks on identical hardware with the same model tell the real story. On an H200 server:

Engine	Throughput	Relative performance
SGLang	2,688 tok/s	Fastest
vLLM	2,021 tok/s	33% slower than SGLang
Ollama	~400–600 tok/s	Convenience layer, not production
llama.cpp	~200–400 tok/s	CPU-optimized, not server-grade

Same hardware. Same model. The engine choice alone is a 5–7× difference between Ollama (what most teams start with) and SGLang.

On a more accessible 4×RTX 3090 setup running Mistral-Large 123B at AWQ Q4 quantization, SGLang with NVLink and torch.compile reaches 37 tokens/second sustained — 3.1× faster than TabbyAPI on identical hardware, and 15–18% faster than vLLM.

The practical rule: Ollama is excellent for local development and personal use. For production serving with more than 5 concurrent users, use TensorRT-LLM (NVIDIA hardware, maximum speed), SGLang (multi-turn conversations, RAG pipelines, structured output), or vLLM (general-purpose, best ecosystem). Never Ollama.

Why bare transformer inference collapses under concurrency

When you load a model with the standard transformers library and call model.generate(), each request runs sequentially. Request 2 waits for request 1 to complete. At 15 concurrent users, every user waits for 14 others.

Production inference engines solve this with continuous batching: instead of waiting for a request to finish before starting the next, they fill idle GPU cycles with other requests' token generation. The result is throughput that scales with load rather than against it.

Combined with PagedAttention (paging the KV cache like virtual memory, so GPU memory is used efficiently across requests), a properly configured vLLM or SGLang instance handles 10–20× more concurrent load than naive model serving on the same hardware.

This is the architecture difference between "it works for 5 internal users" and "it handles 200 concurrent users without degrading."

Quantization: the cost-to-quality tradeoff in production

Quantization reduces model weight precision, shrinking VRAM requirements and increasing inference speed at the cost of some quality. In 2026, MXFP4 and AWQ Q4 are the production sweet spots.

Quantization	VRAM (14B model)	Speed	Quality loss
BF16 (full)	~28 GB	Baseline	None
FP8	~14 GB	1.3–1.5×	Minimal
AWQ Q4	~8 GB	1.8–2.2×	Low
MXFP4	~7 GB	2.0–2.5×	Low
GGUF Q4_K_M	~8 GB	Variable	Low

For RAG deployments specifically, quantization quality loss is less impactful than in open-ended generation, because the model's job is primarily to synthesize retrieved context rather than generate from training knowledge. A quantized model performing well on retrieval-grounded answering is the norm, not the exception. We've run AWQ Q4 in production Gulf enterprise deployments with no complaints about answer quality.

The business implication: a server that runs a 14B BF16 model for 5 concurrent users can, with AWQ Q4, run the same model for 12–15 concurrent users. No new hardware. Just a quantization format change.

Prompt size: the optimization that beats everything else

This is the counterintuitive one. Before you tune any serving parameter or buy any hardware, look at your average prompt length.

One production deployment with 100,000+ users reduced median query latency from 9.87 seconds to 0.71 seconds — a 91% reduction — by switching from full-context RAG (passing 15 retrieved chunks totaling ~12,000 tokens) to selective retrieval (passing the 3 most relevant chunks, ~2,000 tokens).

No model change. No hardware upgrade. No serving configuration change. Just passing less text to the LLM.

Smaller prompts mean faster prefill, lower TTFT (time to first token), and dramatically reduced cost. For cloud LLM APIs, this also directly reduces your per-call spend.

The practical implementation: instead of retrieving the top-10 chunks and passing all of them, retrieve the top-20 and use a reranker to select the top-3 or top-4 with high confidence scores. Pass only those. Your LLM gets cleaner context, users get faster responses, and your token bill drops.

Speculative decoding: the least-known speed multiplier

Speculative decoding is one of the less-discussed speed improvements with real production numbers behind it. The idea: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with your main model. Accepted tokens are essentially free.

On an RTX 3090 running Qwen-2.5-Coder-32B, speculative decoding using a 7B draft model raised throughput from 34.78 tok/s to 51.31 tok/s — a 47% improvement on the same hardware. On a P40 (older datacenter card), from 10.54 tok/s to 17.11 tok/s.

EAGLE3 in SGLang is currently the fastest implementation, but requires a trained draft model. The draft model training is a one-time cost that pays back quickly on high-throughput deployments.

For most teams: implement speculative decoding when you're running a fixed model in production with stable query patterns (i.e., you can verify that acceptance rates are high enough to net positive throughput). It's not worth the complexity for infrequently used models or highly variable query types.

Parallelism: how to scale across multiple GPUs correctly

If a model fits on one GPU, replicate it (data parallelism). If it doesn't fit, shard it (tensor parallelism).

This sounds obvious but the nuance matters. One benchmark: SGLang with --dp 2 (two replicas) served 150% more requests than vLLM with --tp 2 (tensor parallel across 2 GPUs) on the same 2-GPU box. Replication, when the model fits, beats sharding for throughput.

For models that don't fit on one GPU, NVLink between GPUs is worth the premium. On a 4×RTX 3090 setup, NVLink connectivity improved throughput by 12.5% over PCIe Gen4 alone — meaningful at scale.

The full optimization hierarchy

When diagnosing a slow on-prem LLM deployment, work through these in order:

▸Inference engine — Are you using a production serving engine (vLLM, SGLang, TensorRT-LLM)? If not, this is the fix. Expected impact: up to 3–7×.
▸Prompt size — What's your average input token count? Reduce by tightening retrieval or truncating context. Expected impact: up to 91% latency reduction.
▸Quantization — Are you running the most efficient quantization format for your use case? Expected impact: 1.3–2.5× throughput increase.
▸Continuous batching + paged KV cache — Is your serving engine configured correctly? Prefix caching enabled? Expected impact: 3–5× concurrency improvement.
▸Speculative decoding — Do you have a draft model available? Expected impact: 1.25–1.6×.
▸Parallelism strategy — Single GPU replication vs tensor parallel? Are you on NVLink? Expected impact: 10–150% depending on setup.

WARNING

None of these work in isolation. A well-tuned engine with good quantization and continuous batching compounds multiplicatively. An RTX 4090 running SGLang + AWQ Q4 + continuous batching + prefix caching handles more production load than an A100 running llama.cpp with default settings.

The business case for getting this right

On-prem AI has a fundamentally different cost structure from cloud APIs: the hardware is a fixed cost, and the marginal cost of additional inference is nearly zero (power, maintenance). The ROI calculation depends entirely on utilization.

A $15,000 GPU server that serves 200 users with 30-second average response times has a very different ROI than the same server serving 200 users with 800ms average response times. User adoption — and therefore business value extracted from the system — correlates directly with perceived responsiveness.

Teams who do the serving stack work correctly report that their AI deployments actually get used, rather than being used occasionally by the early adopters and then quietly abandoned.

Enterprise RAG Engines →

We architect on-prem RAG deployments with production inference stacks, not just a model loaded with Ollama. Built to handle real concurrent usage. $8K–$30K.

Frequently asked questions

What's the minimum GPU for a production on-prem deployment? An RTX 4090 (24GB VRAM) runs 7B–14B models at Q4 quantization with good concurrency. For 14B–34B models, an A10G (24GB) or RTX 3090 (24GB) is the minimum. For 70B+ models, you need either a 48GB+ card (A6000, RTX 6000 Ada) or multi-GPU with NVLink. A100 80GB and H100 80GB are the enterprise standards for large-scale deployment.

Is TensorRT-LLM worth the compilation complexity? For a fixed model that won't change, yes — it's 30–70% faster than llama.cpp on the same hardware and 20–40% faster than vLLM in many configurations. The compilation step (2–4 hours per model variant) is the cost. For models you iterate on frequently, SGLang or vLLM (no compilation required) is more practical.

How does SGLang handle structured output (JSON) differently from vLLM? SGLang has native RadixAttention for prefix caching, which is especially valuable in RAG pipelines where system prompts are shared across many requests. For structured output generation (JSON schemas), SGLang's constraint-based sampling is also faster than vLLM's. This is why SGLang has become the preferred engine for RAG-heavy deployments.

Can we use multiple inference engines in the same deployment? Yes. A common production pattern: SGLang for the primary user-facing RAG queries, TensorRT-LLM for a secondary model used in batch document processing. Route different request types to the engine best suited for that workload.

→ Production RAG on 6GB VRAM: Qwen3.5 4B + nomic-embed → Why Your RAG System Will Break at Scale

Related services

Enterprise RAG Engines AI SaaS Development