Production RAG on 6GB VRAM: Qwen3.5 4B + nomic-embed
RAG 15 min2025-10-20

Production RAG on 6GB VRAM: Qwen3.5 4B + nomic-embed

Running a production-capable local RAG stack on a single 6GB VRAM GPU. Qwen3.5 4B at Q4_K_M quantization delivers 25–40 tok/s. nomic-embed-text at 274MB handles embeddings. Full setup, benchmarks, and caveats.

Most enterprise RAG tutorials assume you're using OpenAI's API — unlimited compute, no hardware constraints, data going to a third-party server. But for organizations with data privacy requirements (legal firms, clinics, financial services), that's not an option.

This post is the complete guide to running a production-capable RAG system entirely on a single 6GB VRAM GPU. No data leaves your infrastructure. All inference is local.

The stack: Qwen3.5 4B at Q4_K_M quantization for generation, nomic-embed-text for embeddings, Qdrant for vector storage. We've deployed this stack for Gulf financial and medical clients.

NOTE

"Production-capable" here means: handles 10–50 concurrent users, sub-800ms P95 response latency, citation-backed responses, and stays within a 6GB VRAM budget. It's not the same as a 100-concurrent-user cloud deployment, but it's real production for many enterprise teams.

Hardware requirements

Minimum viable stack

ComponentMinimumRecommended
GPU VRAM6GB8GB
GPURTX 3060 12GB / RTX 4060RTX 4080 / A10G
System RAM16GB32GB
Storage100GB SSD500GB NVMe
CPU8 cores16 cores

Wait — RTX 3060 12GB has 12GB VRAM, not 6GB. The 6GB constraint is for the actual minimal option: RTX 3060 (6GB variant), RTX 4060 (8GB), or an older P4000 (8GB). Here's the practical VRAM breakdown:

ModelQuantizationVRAM usageSpeed (RTX 4060)
Qwen3.5 4BQ4_K_M~3.0 GB30–45 tok/s
Qwen3.5 4BQ5_K_M~3.5 GB25–35 tok/s
Qwen3.5 4BQ8_0~4.8 GB18–25 tok/s
nomic-embed-text274 MB500 embed/s
Qdrant overhead~200 MB
Total (Q4_K_M)~3.5 GB

On a 6GB card, Q4_K_M leaves 2.5GB headroom for context. Comfortable for production RAG with 4K–8K context windows.

Setting up Ollama

Ollama is the simplest way to run quantized models locally with an OpenAI-compatible API.

# Install Ollama (Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull Qwen3.5 4B Q4_K_M — this is the GGUF quantized version
ollama pull qwen3.5:4b-instruct-q4_K_M

# Pull nomic-embed-text for embeddings
ollama pull nomic-embed-text

# Verify GPU detection
ollama run qwen3.5:4b-instruct-q4_K_M "Hello" 2>&1 | grep -i gpu
# Should show: GPU: NVIDIA ...

Ollama model configuration

For production, override Ollama's defaults to reduce hallucination risk in RAG:

# Create a Modelfile for production RAG settings
cat > Modelfile << 'EOF'
FROM qwen3.5:4b-instruct-q4_K_M

# System prompt for constrained RAG responses
SYSTEM """You are a precise knowledge assistant. Answer questions ONLY based on the provided context.
If the answer is not in the context, say exactly: "This information is not in the available documents."
Always cite the document name and section when quoting or paraphrasing.
Be concise. Avoid speculation."""

# Production parameters
PARAMETER temperature 0.1      # low temp for factual tasks
PARAMETER top_p 0.9
PARAMETER num_ctx 8192          # context window
PARAMETER num_predict 512       # max response tokens
PARAMETER repeat_penalty 1.1   # reduce repetition
EOF

ollama create verel-rag -f Modelfile

Vector database: Qdrant self-hosted

Qdrant is our default for on-prem RAG. It's a Rust-based vector database with excellent performance on modest hardware and clean Python and REST APIs.

# Run Qdrant with Docker (persistent storage)
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant:latest

# Verify it's running
curl http://localhost:6333/healthz
# {"status":"ok","time":0.000...}

The ingestion pipeline

Document processing strategy

Chunking strategy is the most impactful decision in a RAG pipeline. Wrong chunking degrades retrieval quality more than model selection does.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import ollama
import uuid

# ── Configuration ──────────────────────────────────────────────
CHUNK_SIZE       = 512     # tokens per chunk
CHUNK_OVERLAP    = 64      # overlap to preserve cross-boundary context
EMBED_MODEL      = "nomic-embed-text"
COLLECTION_NAME  = "enterprise_docs"
EMBED_DIM        = 768     # nomic-embed-text output dimension

# ── Initialize Qdrant ──────────────────────────────────────────
qdrant = QdrantClient(host="localhost", port=6333)

# Create collection if it doesn't exist
if COLLECTION_NAME not in [c.name for c in qdrant.get_collections().collections]:
    qdrant.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
    )

# ── Text splitter ──────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
)

# ── Embedding function ─────────────────────────────────────────
def embed_texts(texts: list[str]) -> list[list[float]]:
    """Batch embed with nomic-embed-text via Ollama."""
    response = ollama.embed(model=EMBED_MODEL, input=texts)
    return response["embeddings"]

# ── Ingestion ──────────────────────────────────────────────────
def ingest_directory(path: str, namespace: str = "default"):
    """
    Ingest all PDFs from a directory.
    namespace: use for tenant isolation in multi-client deployments
    """
    loader = DirectoryLoader(path, glob="**/*.pdf", loader_cls=PyPDFLoader)
    docs   = loader.load()
    chunks = splitter.split_documents(docs)
    
    print(f"Ingesting {len(chunks)} chunks from {len(docs)} documents...")
    
    # Process in batches of 32 for embedding
    BATCH_SIZE = 32
    points = []
    
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        texts = [c.page_content for c in batch]
        vecs  = embed_texts(texts)
        
        for chunk, vec in zip(batch, vecs):
            points.append(PointStruct(
                id=str(uuid.uuid4()),
                vector=vec,
                payload={
                    "text":      chunk.page_content,
                    "source":    chunk.metadata.get("source", "unknown"),
                    "page":      chunk.metadata.get("page", 0),
                    "namespace": namespace,
                }
            ))
    
    qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
    print(f"✓ Ingested {len(points)} vectors")

Ingest speed benchmarks

Measured on an RTX 4060 (16GB system RAM):

Corpus sizeChunk countnomic-embed timeQdrant write timeTotal
100 PDFs (~500 pages)~2,50012s2s~15s
1,000 PDFs (~5,000 pages)~25,00090s18s~2 min
10,000 PDFs (~50,000 pages)~250,00015 min3 min~18 min

Initial ingestion of a 10,000-document corpus completes in under 20 minutes on modest hardware. Incremental updates (new documents only) complete in seconds.

The retrieval and generation pipeline

from openai import OpenAI  # Ollama uses OpenAI-compatible API

# Connect to local Ollama via OpenAI SDK
llm_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require a real API key
)

def retrieve(query: str, namespace: str = "default", top_k: int = 6) -> list[dict]:
    """Retrieve top-k relevant chunks with namespace filtering."""
    query_vec = embed_texts([query])[0]
    
    results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_vec,
        limit=top_k,
        query_filter={
            "must": [{"key": "namespace", "match": {"value": namespace}}]
        },
        with_payload=True,
        score_threshold=0.6,   # discard low-relevance chunks
    )
    
    return [
        {
            "text":   r.payload["text"],
            "source": r.payload["source"],
            "page":   r.payload["page"],
            "score":  r.score,
        }
        for r in results
    ]

def generate(query: str, namespace: str = "default") -> dict:
    """Full RAG pipeline: retrieve → format context → generate → cite."""
    
    chunks = retrieve(query, namespace)
    
    if not chunks:
        return {
            "answer":   "This information is not in the available documents.",
            "sources":  [],
            "chunks_used": 0,
        }
    
    # Format context with source attribution
    context_parts = []
    for i, c in enumerate(chunks, 1):
        context_parts.append(
            f"[Source {i}: {c['source']}, p.{c['page']}]\n{c['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)
    
    prompt = f"""Answer the following question using ONLY the provided context.
Cite sources using [Source N] notation when referencing specific information.

Context:
{context}

Question: {query}

Answer:"""
    
    response = llm_client.chat.completions.create(
        model="verel-rag",  # our custom Modelfile model
        messages=[{"role": "user", "content": prompt}],
        stream=False,
    )
    
    answer = response.choices[0].message.content
    
    # Extract referenced sources from the answer
    cited_sources = []
    for i, c in enumerate(chunks, 1):
        if f"[Source {i}]" in answer:
            cited_sources.append({
                "index":  i,
                "file":   c["source"],
                "page":   c["page"],
                "score":  round(c["score"], 3),
            })
    
    return {
        "answer":       answer,
        "sources":      cited_sources,
        "chunks_used":  len(chunks),
    }

Performance benchmarks (RTX 4060, Q4_K_M)

Measured on a production deployment with a 10,000-document corpus:

MetricP50P95P99
Retrieval latency (Qdrant)15ms35ms60ms
Embedding latency (nomic)40ms80ms120ms
LLM generation (200 tokens)380ms520ms650ms
Total end-to-end430ms620ms800ms
Concurrent users @ P95 <1s8–12
TIP

P95 latency at 800ms is well within acceptable range for document Q&A use cases. Users interacting with a knowledge base expect slightly higher latency than a chat application. The 630ms P50 is excellent for on-prem hardware.

Production considerations

Multi-tenant isolation

For deployments where multiple clients or departments share infrastructure, use Qdrant's payload filtering for namespace isolation:

# Each client gets a unique namespace
# Data is stored in the same collection but isolated by filter
result = qdrant.search(
    collection_name=COLLECTION_NAME,
    query_vector=query_vec,
    query_filter={"must": [{"key": "namespace", "match": {"value": client_id}}]},
    limit=6,
)

This is simpler than separate collections per tenant and performs identically at scale up to millions of vectors.

Hybrid search (dense + sparse)

For production deployments where recall matters (you can't afford to miss relevant documents), add BM25 sparse search alongside vector search:

from qdrant_client.models import SparseVector, NamedSparseVector
from fastembed import SparseTextEmbedding

sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

def hybrid_retrieve(query: str, namespace: str, top_k: int = 6) -> list[dict]:
    # Dense vector (semantic)
    dense_vec = embed_texts([query])[0]
    
    # Sparse vector (BM25 keyword)
    sparse_result = list(sparse_model.query_embed(query))[0]
    sparse_vec = SparseVector(
        indices=sparse_result.indices.tolist(),
        values=sparse_result.values.tolist(),
    )
    
    results = qdrant.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            {"query": dense_vec,  "limit": 20},
            {"query": NamedSparseVector(name="text-sparse", vector=sparse_vec), "limit": 20},
        ],
        query=SparseVector(indices=[], values=[]),  # fusion
        using="rrf",    # Reciprocal Rank Fusion
        limit=top_k,
        query_filter={"must": [{"key": "namespace", "match": {"value": namespace}}]},
    )
    
    return [{"text": r.payload["text"], "source": r.payload["source"], ...} for r in results.points]

Hybrid search typically improves recall by 10–20% over pure vector search, especially for queries with specific technical terms (model names, product codes) that semantic search can miss.

Enterprise RAG Engines
We build private, citation-backed RAG systems for data-sensitive industries. On-prem deployment, zero data leaves your building. $8K–$30K.

Deployment checklist

  • GPU driver ≥ CUDA 12.1 (required by Ollama)
  • Qdrant running with persistent volume mount (data survives restarts)
  • Ollama systemd service (auto-restart on crash)
  • GPU memory lock: nvidia-smi -pm 1 (prevent driver power management from evicting model)
  • Monitoring: track GPU utilization, VRAM usage, request queue depth
  • Backup: Qdrant collection snapshots daily (POST /collections/{name}/snapshots)
  • Rate limiting on the API layer (prevent one user from saturating the GPU)

Frequently asked questions

Can I run this on a CPU-only server? Yes, with significant latency tradeoff. On a modern 16-core CPU, Qwen3.5 4B Q4_K_M generates at 4–8 tok/s vs 30–45 tok/s on an RTX 4060. End-to-end latency increases to 2–5 seconds. Acceptable for asynchronous batch processing but not for realtime Q&A.

What if my documents are in Arabic? Qwen3.5 was pre-trained on multilingual data including Arabic. Performance on Arabic RAG is good for MSA. For mixed Arabic/English document corpora, we recommend Multilingual-E5-large as the embedding model instead of nomic-embed-text, as it has stronger Arabic embedding quality.

Can Qdrant handle 1 million+ vectors on a single instance? Yes. Qdrant comfortably handles 10M+ vectors on a single server with 32GB RAM. At 1M vectors with 768-dim embeddings, memory usage is approximately 3–4GB. Qdrant uses HNSW indexing with configurable m and ef_construct parameters to balance recall and memory.

Is Q4_K_M quantization safe for production? For RAG (where the model is grounding responses in retrieved context), Q4_K_M quality loss is negligible compared to full precision. The factual accuracy is gated by retrieval quality, not model precision. We run Q4_K_M in production for Gulf enterprise clients with no complaints about answer quality.

RAG vs Fine-tuning: The Right Tool for Enterprise Knowledge

Related services