Production RAG on 6GB VRAM: Qwen3.5 4B + nomic-embed
Running a production-capable local RAG stack on a single 6GB VRAM GPU. Qwen3.5 4B at Q4_K_M quantization delivers 25–40 tok/s. nomic-embed-text at 274MB handles embeddings. Full setup, benchmarks, and caveats.
Most enterprise RAG tutorials assume you're using OpenAI's API — unlimited compute, no hardware constraints, data going to a third-party server. But for organizations with data privacy requirements (legal firms, clinics, financial services), that's not an option.
This post is the complete guide to running a production-capable RAG system entirely on a single 6GB VRAM GPU. No data leaves your infrastructure. All inference is local.
The stack: Qwen3.5 4B at Q4_K_M quantization for generation, nomic-embed-text for embeddings, Qdrant for vector storage. We've deployed this stack for Gulf financial and medical clients.
"Production-capable" here means: handles 10–50 concurrent users, sub-800ms P95 response latency, citation-backed responses, and stays within a 6GB VRAM budget. It's not the same as a 100-concurrent-user cloud deployment, but it's real production for many enterprise teams.
Hardware requirements
Minimum viable stack
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 6GB | 8GB |
| GPU | RTX 3060 12GB / RTX 4060 | RTX 4080 / A10G |
| System RAM | 16GB | 32GB |
| Storage | 100GB SSD | 500GB NVMe |
| CPU | 8 cores | 16 cores |
Wait — RTX 3060 12GB has 12GB VRAM, not 6GB. The 6GB constraint is for the actual minimal option: RTX 3060 (6GB variant), RTX 4060 (8GB), or an older P4000 (8GB). Here's the practical VRAM breakdown:
| Model | Quantization | VRAM usage | Speed (RTX 4060) |
|---|---|---|---|
| Qwen3.5 4B | Q4_K_M | ~3.0 GB | 30–45 tok/s |
| Qwen3.5 4B | Q5_K_M | ~3.5 GB | 25–35 tok/s |
| Qwen3.5 4B | Q8_0 | ~4.8 GB | 18–25 tok/s |
| nomic-embed-text | — | 274 MB | 500 embed/s |
| Qdrant overhead | — | ~200 MB | — |
| Total (Q4_K_M) | ~3.5 GB | — |
On a 6GB card, Q4_K_M leaves 2.5GB headroom for context. Comfortable for production RAG with 4K–8K context windows.
Setting up Ollama
Ollama is the simplest way to run quantized models locally with an OpenAI-compatible API.
# Install Ollama (Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull Qwen3.5 4B Q4_K_M — this is the GGUF quantized version
ollama pull qwen3.5:4b-instruct-q4_K_M
# Pull nomic-embed-text for embeddings
ollama pull nomic-embed-text
# Verify GPU detection
ollama run qwen3.5:4b-instruct-q4_K_M "Hello" 2>&1 | grep -i gpu
# Should show: GPU: NVIDIA ...
Ollama model configuration
For production, override Ollama's defaults to reduce hallucination risk in RAG:
# Create a Modelfile for production RAG settings
cat > Modelfile << 'EOF'
FROM qwen3.5:4b-instruct-q4_K_M
# System prompt for constrained RAG responses
SYSTEM """You are a precise knowledge assistant. Answer questions ONLY based on the provided context.
If the answer is not in the context, say exactly: "This information is not in the available documents."
Always cite the document name and section when quoting or paraphrasing.
Be concise. Avoid speculation."""
# Production parameters
PARAMETER temperature 0.1 # low temp for factual tasks
PARAMETER top_p 0.9
PARAMETER num_ctx 8192 # context window
PARAMETER num_predict 512 # max response tokens
PARAMETER repeat_penalty 1.1 # reduce repetition
EOF
ollama create verel-rag -f Modelfile
Vector database: Qdrant self-hosted
Qdrant is our default for on-prem RAG. It's a Rust-based vector database with excellent performance on modest hardware and clean Python and REST APIs.
# Run Qdrant with Docker (persistent storage)
docker run -d \
--name qdrant \
-p 6333:6333 \
-p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant:latest
# Verify it's running
curl http://localhost:6333/healthz
# {"status":"ok","time":0.000...}
The ingestion pipeline
Document processing strategy
Chunking strategy is the most impactful decision in a RAG pipeline. Wrong chunking degrades retrieval quality more than model selection does.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import ollama
import uuid
# ── Configuration ──────────────────────────────────────────────
CHUNK_SIZE = 512 # tokens per chunk
CHUNK_OVERLAP = 64 # overlap to preserve cross-boundary context
EMBED_MODEL = "nomic-embed-text"
COLLECTION_NAME = "enterprise_docs"
EMBED_DIM = 768 # nomic-embed-text output dimension
# ── Initialize Qdrant ──────────────────────────────────────────
qdrant = QdrantClient(host="localhost", port=6333)
# Create collection if it doesn't exist
if COLLECTION_NAME not in [c.name for c in qdrant.get_collections().collections]:
qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
)
# ── Text splitter ──────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
)
# ── Embedding function ─────────────────────────────────────────
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed with nomic-embed-text via Ollama."""
response = ollama.embed(model=EMBED_MODEL, input=texts)
return response["embeddings"]
# ── Ingestion ──────────────────────────────────────────────────
def ingest_directory(path: str, namespace: str = "default"):
"""
Ingest all PDFs from a directory.
namespace: use for tenant isolation in multi-client deployments
"""
loader = DirectoryLoader(path, glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
chunks = splitter.split_documents(docs)
print(f"Ingesting {len(chunks)} chunks from {len(docs)} documents...")
# Process in batches of 32 for embedding
BATCH_SIZE = 32
points = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
texts = [c.page_content for c in batch]
vecs = embed_texts(texts)
for chunk, vec in zip(batch, vecs):
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=vec,
payload={
"text": chunk.page_content,
"source": chunk.metadata.get("source", "unknown"),
"page": chunk.metadata.get("page", 0),
"namespace": namespace,
}
))
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"✓ Ingested {len(points)} vectors")
Ingest speed benchmarks
Measured on an RTX 4060 (16GB system RAM):
| Corpus size | Chunk count | nomic-embed time | Qdrant write time | Total |
|---|---|---|---|---|
| 100 PDFs (~500 pages) | ~2,500 | 12s | 2s | ~15s |
| 1,000 PDFs (~5,000 pages) | ~25,000 | 90s | 18s | ~2 min |
| 10,000 PDFs (~50,000 pages) | ~250,000 | 15 min | 3 min | ~18 min |
Initial ingestion of a 10,000-document corpus completes in under 20 minutes on modest hardware. Incremental updates (new documents only) complete in seconds.
The retrieval and generation pipeline
from openai import OpenAI # Ollama uses OpenAI-compatible API
# Connect to local Ollama via OpenAI SDK
llm_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't require a real API key
)
def retrieve(query: str, namespace: str = "default", top_k: int = 6) -> list[dict]:
"""Retrieve top-k relevant chunks with namespace filtering."""
query_vec = embed_texts([query])[0]
results = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_vec,
limit=top_k,
query_filter={
"must": [{"key": "namespace", "match": {"value": namespace}}]
},
with_payload=True,
score_threshold=0.6, # discard low-relevance chunks
)
return [
{
"text": r.payload["text"],
"source": r.payload["source"],
"page": r.payload["page"],
"score": r.score,
}
for r in results
]
def generate(query: str, namespace: str = "default") -> dict:
"""Full RAG pipeline: retrieve → format context → generate → cite."""
chunks = retrieve(query, namespace)
if not chunks:
return {
"answer": "This information is not in the available documents.",
"sources": [],
"chunks_used": 0,
}
# Format context with source attribution
context_parts = []
for i, c in enumerate(chunks, 1):
context_parts.append(
f"[Source {i}: {c['source']}, p.{c['page']}]\n{c['text']}"
)
context = "\n\n---\n\n".join(context_parts)
prompt = f"""Answer the following question using ONLY the provided context.
Cite sources using [Source N] notation when referencing specific information.
Context:
{context}
Question: {query}
Answer:"""
response = llm_client.chat.completions.create(
model="verel-rag", # our custom Modelfile model
messages=[{"role": "user", "content": prompt}],
stream=False,
)
answer = response.choices[0].message.content
# Extract referenced sources from the answer
cited_sources = []
for i, c in enumerate(chunks, 1):
if f"[Source {i}]" in answer:
cited_sources.append({
"index": i,
"file": c["source"],
"page": c["page"],
"score": round(c["score"], 3),
})
return {
"answer": answer,
"sources": cited_sources,
"chunks_used": len(chunks),
}
Performance benchmarks (RTX 4060, Q4_K_M)
Measured on a production deployment with a 10,000-document corpus:
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| Retrieval latency (Qdrant) | 15ms | 35ms | 60ms |
| Embedding latency (nomic) | 40ms | 80ms | 120ms |
| LLM generation (200 tokens) | 380ms | 520ms | 650ms |
| Total end-to-end | 430ms | 620ms | 800ms |
| Concurrent users @ P95 <1s | 8–12 | — | — |
P95 latency at 800ms is well within acceptable range for document Q&A use cases. Users interacting with a knowledge base expect slightly higher latency than a chat application. The 630ms P50 is excellent for on-prem hardware.
Production considerations
Multi-tenant isolation
For deployments where multiple clients or departments share infrastructure, use Qdrant's payload filtering for namespace isolation:
# Each client gets a unique namespace
# Data is stored in the same collection but isolated by filter
result = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_vec,
query_filter={"must": [{"key": "namespace", "match": {"value": client_id}}]},
limit=6,
)
This is simpler than separate collections per tenant and performs identically at scale up to millions of vectors.
Hybrid search (dense + sparse)
For production deployments where recall matters (you can't afford to miss relevant documents), add BM25 sparse search alongside vector search:
from qdrant_client.models import SparseVector, NamedSparseVector
from fastembed import SparseTextEmbedding
sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")
def hybrid_retrieve(query: str, namespace: str, top_k: int = 6) -> list[dict]:
# Dense vector (semantic)
dense_vec = embed_texts([query])[0]
# Sparse vector (BM25 keyword)
sparse_result = list(sparse_model.query_embed(query))[0]
sparse_vec = SparseVector(
indices=sparse_result.indices.tolist(),
values=sparse_result.values.tolist(),
)
results = qdrant.query_points(
collection_name=COLLECTION_NAME,
prefetch=[
{"query": dense_vec, "limit": 20},
{"query": NamedSparseVector(name="text-sparse", vector=sparse_vec), "limit": 20},
],
query=SparseVector(indices=[], values=[]), # fusion
using="rrf", # Reciprocal Rank Fusion
limit=top_k,
query_filter={"must": [{"key": "namespace", "match": {"value": namespace}}]},
)
return [{"text": r.payload["text"], "source": r.payload["source"], ...} for r in results.points]
Hybrid search typically improves recall by 10–20% over pure vector search, especially for queries with specific technical terms (model names, product codes) that semantic search can miss.
Deployment checklist
- ▸ GPU driver ≥ CUDA 12.1 (required by Ollama)
- ▸ Qdrant running with persistent volume mount (data survives restarts)
- ▸ Ollama systemd service (auto-restart on crash)
- ▸ GPU memory lock:
nvidia-smi -pm 1(prevent driver power management from evicting model) - ▸ Monitoring: track GPU utilization, VRAM usage, request queue depth
- ▸ Backup: Qdrant collection snapshots daily (
POST /collections/{name}/snapshots) - ▸ Rate limiting on the API layer (prevent one user from saturating the GPU)
Frequently asked questions
Can I run this on a CPU-only server? Yes, with significant latency tradeoff. On a modern 16-core CPU, Qwen3.5 4B Q4_K_M generates at 4–8 tok/s vs 30–45 tok/s on an RTX 4060. End-to-end latency increases to 2–5 seconds. Acceptable for asynchronous batch processing but not for realtime Q&A.
What if my documents are in Arabic?
Qwen3.5 was pre-trained on multilingual data including Arabic. Performance on Arabic RAG is good for MSA. For mixed Arabic/English document corpora, we recommend Multilingual-E5-large as the embedding model instead of nomic-embed-text, as it has stronger Arabic embedding quality.
Can Qdrant handle 1 million+ vectors on a single instance?
Yes. Qdrant comfortably handles 10M+ vectors on a single server with 32GB RAM. At 1M vectors with 768-dim embeddings, memory usage is approximately 3–4GB. Qdrant uses HNSW indexing with configurable m and ef_construct parameters to balance recall and memory.
Is Q4_K_M quantization safe for production? For RAG (where the model is grounding responses in retrieved context), Q4_K_M quality loss is negligible compared to full precision. The factual accuracy is gated by retrieval quality, not model precision. We run Q4_K_M in production for Gulf enterprise clients with no complaints about answer quality.
→ RAG vs Fine-tuning: The Right Tool for Enterprise Knowledge