Home Posts Private RAG with Local Llama-3 and Qdrant [Deep Dive]
AI Engineering

Private RAG with Local Llama-3 and Qdrant [Deep Dive]

Private RAG with Local Llama-3 and Qdrant [Deep Dive]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 05, 2026 · 8 min read

Bottom Line

The clean pattern is simple: use a dedicated local embedding model for retrieval, store vectors in Qdrant, and reserve Llama 3 for final answer synthesis. That gives you a private RAG loop that runs on localhost after the initial model download.

Key Takeaways

  • Use llama3 for answer generation and embeddinggemma for retrieval, not one model for both jobs.
  • Qdrant local setup is a single Docker run on ports 6333 and 6334.
  • Collection vector size must match your embedding model output exactly, or inserts and queries will fail.
  • A minimal Python RAG loop only needs ollama and qdrant-client packages.

Private RAG is attractive for one reason: you can search internal notes, runbooks, and product docs without shipping them to a hosted LLM provider. The reliable pattern is to split the job in two. Use a local embedding model to turn text into vectors, let Qdrant handle nearest-neighbor retrieval, and use llama3 only for the final grounded answer. This walkthrough builds that stack end to end on localhost with Python.

What You'll Build

Bottom Line

For a private RAG stack, keep retrieval and generation separate: store local embeddings in Qdrant, then pass only the top matching chunks into llama3. It is simpler, more accurate, and easier to debug than trying to make one model do everything.

Prerequisites

  • Python 3.11+
  • Docker installed and running for local Qdrant
  • Ollama installed from ollama.com/download
  • Enough local RAM or VRAM to run llama3 comfortably on your machine
  • A small private document set to index, preferably sanitized before demos

If you want to demo realistic internal content without exposing names, emails, or secrets, run the sample text through TechBytes’ Data Masking Tool first.

Architecture

  • Ollama runs local models and exposes APIs on http://localhost:11434/api.
  • embeddinggemma generates vectors for both document chunks and user questions.
  • Qdrant stores vectors plus payload such as source name and chunk text.
  • llama3 answers with retrieved context instead of raw memory.

This matches current Ollama guidance: the embeddings docs recommend dedicated embedding models such as embeddinggemma, qwen3-embedding, and all-minilm, and Qdrant’s local quickstart uses a Docker container exposed on 6333 and 6334.

Step 1: Start the Local Stack

  1. Pull the local models

    Ollama’s CLI supports both ollama pull and ollama run. Pull the chat model and the embedding model up front so your first query does not stall on downloads.

    ollama pull llama3
    ollama pull embeddinggemma
  2. Start Qdrant locally

    Qdrant’s official quickstart uses the container below. The volume mount persists your vectors between runs.

    docker pull qdrant/qdrant
    
    docker run -p 6333:6333 -p 6334:6334 \
      -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
      qdrant/qdrant
  3. Create a Python environment

    For this tutorial, you only need the official Ollama package and Qdrant’s Python client.

    python3 -m venv .venv
    source .venv/bin/activate
    pip install ollama qdrant-client
Pro tip: Use one embedding model for both indexing and querying. Ollama’s embeddings docs call this out explicitly, and mixing models is one of the fastest ways to degrade retrieval quality.

Step 2: Index Private Docs

Now build a small ingestion script. The core idea is straightforward: split documents into chunks, generate vectors with ollama.embed(), create a Qdrant collection sized to that embedding length, then upsert() the points.

from uuid import uuid4

import ollama
from qdrant_client import QdrantClient, models

COLLECTION = "private_docs"
EMBED_MODEL = "embeddinggemma"
CHAT_MODEL = "llama3"

DOCUMENTS = [
    {
        "source": "payments-runbook",
        "text": """Restart the payments worker only after draining the queue.

Confirm there are no stuck jobs older than 10 minutes.

If retries spike after restart, roll back the last worker config change first.""",
    },
    {
        "source": "incident-handbook",
        "text": """Customer-visible incidents require a status page update within 15 minutes.

Log the mitigation step before changing infra so the next responder can follow the timeline.""",
    },
]


def chunk_documents(documents):
    chunks = []
    for doc in documents:
        paragraphs = [p.strip() for p in doc["text"].split("\n\n") if p.strip()]
        for index, paragraph in enumerate(paragraphs):
            chunks.append(
                {
                    "id": str(uuid4()),
                    "source": doc["source"],
                    "chunk": index,
                    "text": paragraph,
                }
            )
    return chunks


def embed_texts(texts):
    response = ollama.embed(model=EMBED_MODEL, input=texts)
    return response["embeddings"]


client = QdrantClient(url="http://localhost:6333")
chunks = chunk_documents(DOCUMENTS)
vectors = embed_texts([chunk["text"] for chunk in chunks])
vector_size = len(vectors[0])

if not client.collection_exists(collection_name=COLLECTION):
    client.create_collection(
        collection_name=COLLECTION,
        vectors_config=models.VectorParams(
            size=vector_size,
            distance=models.Distance.COSINE,
        ),
    )

points = [
    models.PointStruct(
        id=chunk["id"],
        vector=vector,
        payload={
            "source": chunk["source"],
            "chunk": chunk["chunk"],
            "text": chunk["text"],
        },
    )
    for chunk, vector in zip(chunks, vectors)
]

client.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks into {COLLECTION}")

Three implementation details matter here:

  • Paragraph chunking is enough to prove the pattern before you add token-aware chunking.
  • Cosine distance is the right default because Ollama’s /api/embed returns L2-normalized vectors.
  • Collection size comes from the live embedding output, so you do not hardcode a dimension and get it wrong later.

Step 3: Query and Answer

Once the vectors are in Qdrant, retrieval is a dense-vector nearest-neighbor query. The answer stage is just a grounded chat call that includes the retrieved chunks as context.

def answer_question(question, limit=3):
    query_vector = ollama.embed(model=EMBED_MODEL, input=question)["embeddings"][0]

    hits = client.query_points(
        collection_name=COLLECTION,
        query=query_vector,
        limit=limit,
    ).points

    context = "\n\n".join(
        f"[{hit.payload['source']}#{hit.payload['chunk']}] {hit.payload['text']}"
        for hit in hits
    )

    response = ollama.chat(
        model=CHAT_MODEL,
        messages=[
            {
                "role": "system",
                "content": "Answer only from the supplied context. If the answer is missing, say that the context does not contain it.",
            },
            {
                "role": "user",
                "content": f"Question: {question}\n\nContext:\n{context}",
            },
        ],
    )

    return hits, response["message"]["content"]


question = "What should an on-call engineer do before restarting the payments worker?"
hits, answer = answer_question(question)

print("Top matches:")
for hit in hits:
    print(f"- {hit.payload['source']}#{hit.payload['chunk']} score={hit.score:.3f}")

print("\nAnswer:\n")
print(answer)

The main design choice is intentional: llama3 never sees the whole corpus. It only sees the top retrieved chunks. That gives you smaller prompts, lower latency, and much cleaner failure analysis when answers drift.

Verification / Expected Output

Run the indexing block first, then run the query block. A healthy local setup should show all of the following:

  • The script prints that chunks were indexed into private_docs.
  • The query returns a short list of top matches from the correct source document.
  • The final answer references draining the queue before restart, which is the relevant chunk in the sample corpus.

Representative output looks like this:

Indexed 5 chunks into private_docs
Top matches:
- payments-runbook#0 score=0.812
- payments-runbook#1 score=0.774
- incident-handbook#1 score=0.521

Answer:
Before restarting the payments worker, the on-call engineer should drain the queue first.
The context also says to confirm there are no stuck jobs older than 10 minutes.

If you want a faster sanity check, hit Qdrant directly with curl http://localhost:6333/collections. A valid response should include your private_docs collection.

Troubleshooting and What's Next

Troubleshooting Top 3

  1. Connection refused on port 11434
    Ollama is not running. Start the desktop app or run ollama serve, then retry the Python script.
  2. Vector size mismatch in Qdrant
    You created the collection with one embedding model and queried with another. Recreate the collection or keep the same embedding model for both indexing and search.
  3. Answers sound plausible but miss the exact policy
    Your chunks are too broad or your prompt is too soft. Reduce chunk size, increase retrieval limit slightly, and keep the system instruction explicit: answer only from supplied context.
Watch out: A private RAG pipeline is only as private as the data you feed it. Model inference can stay local, but raw source files can still leak through logs, copied prompts, or demo datasets if you do not treat ingestion as a security boundary.

What’s next

  • Add metadata filters in Qdrant so teams only search the docs they are allowed to see.
  • Replace paragraph splitting with token-aware chunking and overlap.
  • Store document IDs, timestamps, and owners in payload for auditability.
  • Add a reranking stage if your corpus grows and top-k retrieval gets noisy.
  • Snapshot the Qdrant volume so your local index is recoverable.

That is the practical baseline for private RAG in 2026: local embeddings, local vector search, local answer synthesis, and a thin Python layer you can actually inspect. Start here, then harden chunking, authorization, and observability once the retrieval quality is stable.

Frequently Asked Questions

Can I use Llama 3 for embeddings in this RAG stack? +
You can force a chat model into embedding-like work, but you should not. Ollama’s current embeddings guidance recommends dedicated models such as embeddinggemma, and using the same embedding model for both indexing and querying keeps vector space behavior consistent.
Why does Qdrant throw a vector size mismatch error? +
A Qdrant collection is created with a fixed vector dimension. If you indexed with one embedding model and later query or upsert with another model that returns a different length, Qdrant will reject the request until you recreate the collection or switch back to the original model.
Does this private RAG setup keep data on my machine? +
For local models such as llama3 and embeddinggemma, inference runs through Ollama on localhost after the initial model download. Qdrant is also self-hosted here, so vectors and payload stay in your local container volume unless you explicitly export or sync them elsewhere.
Do I need Docker to use Qdrant for local development? +
Not strictly. Qdrant’s docs also mention a Python client :memory: mode for lightweight experiments, but Docker is the better tutorial baseline because it behaves more like a real persisted service and keeps your vectors between runs.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.