Private RAG with Local Llama-3 and Qdrant [Deep Dive]
Bottom Line
The clean pattern is simple: use a dedicated local embedding model for retrieval, store vectors in Qdrant, and reserve Llama 3 for final answer synthesis. That gives you a private RAG loop that runs on localhost after the initial model download.
Key Takeaways
- ›Use llama3 for answer generation and embeddinggemma for retrieval, not one model for both jobs.
- ›Qdrant local setup is a single Docker run on ports 6333 and 6334.
- ›Collection vector size must match your embedding model output exactly, or inserts and queries will fail.
- ›A minimal Python RAG loop only needs ollama and qdrant-client packages.
Private RAG is attractive for one reason: you can search internal notes, runbooks, and product docs without shipping them to a hosted LLM provider. The reliable pattern is to split the job in two. Use a local embedding model to turn text into vectors, let Qdrant handle nearest-neighbor retrieval, and use llama3 only for the final grounded answer. This walkthrough builds that stack end to end on localhost with Python.
What You'll Build
Bottom Line
For a private RAG stack, keep retrieval and generation separate: store local embeddings in Qdrant, then pass only the top matching chunks into llama3. It is simpler, more accurate, and easier to debug than trying to make one model do everything.
Prerequisites
- Python 3.11+
- Docker installed and running for local Qdrant
- Ollama installed from ollama.com/download
- Enough local RAM or VRAM to run llama3 comfortably on your machine
- A small private document set to index, preferably sanitized before demos
If you want to demo realistic internal content without exposing names, emails, or secrets, run the sample text through TechBytes’ Data Masking Tool first.
Architecture
- Ollama runs local models and exposes APIs on
http://localhost:11434/api. - embeddinggemma generates vectors for both document chunks and user questions.
- Qdrant stores vectors plus payload such as source name and chunk text.
- llama3 answers with retrieved context instead of raw memory.
This matches current Ollama guidance: the embeddings docs recommend dedicated embedding models such as embeddinggemma, qwen3-embedding, and all-minilm, and Qdrant’s local quickstart uses a Docker container exposed on 6333 and 6334.
Step 1: Start the Local Stack
Pull the local models
Ollama’s CLI supports both ollama pull and ollama run. Pull the chat model and the embedding model up front so your first query does not stall on downloads.
ollama pull llama3 ollama pull embeddinggemmaStart Qdrant locally
Qdrant’s official quickstart uses the container below. The volume mount persists your vectors between runs.
docker pull qdrant/qdrant docker run -p 6333:6333 -p 6334:6334 \ -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \ qdrant/qdrantCreate a Python environment
For this tutorial, you only need the official Ollama package and Qdrant’s Python client.
python3 -m venv .venv source .venv/bin/activate pip install ollama qdrant-client
Step 2: Index Private Docs
Now build a small ingestion script. The core idea is straightforward: split documents into chunks, generate vectors with ollama.embed(), create a Qdrant collection sized to that embedding length, then upsert() the points.
from uuid import uuid4
import ollama
from qdrant_client import QdrantClient, models
COLLECTION = "private_docs"
EMBED_MODEL = "embeddinggemma"
CHAT_MODEL = "llama3"
DOCUMENTS = [
{
"source": "payments-runbook",
"text": """Restart the payments worker only after draining the queue.
Confirm there are no stuck jobs older than 10 minutes.
If retries spike after restart, roll back the last worker config change first.""",
},
{
"source": "incident-handbook",
"text": """Customer-visible incidents require a status page update within 15 minutes.
Log the mitigation step before changing infra so the next responder can follow the timeline.""",
},
]
def chunk_documents(documents):
chunks = []
for doc in documents:
paragraphs = [p.strip() for p in doc["text"].split("\n\n") if p.strip()]
for index, paragraph in enumerate(paragraphs):
chunks.append(
{
"id": str(uuid4()),
"source": doc["source"],
"chunk": index,
"text": paragraph,
}
)
return chunks
def embed_texts(texts):
response = ollama.embed(model=EMBED_MODEL, input=texts)
return response["embeddings"]
client = QdrantClient(url="http://localhost:6333")
chunks = chunk_documents(DOCUMENTS)
vectors = embed_texts([chunk["text"] for chunk in chunks])
vector_size = len(vectors[0])
if not client.collection_exists(collection_name=COLLECTION):
client.create_collection(
collection_name=COLLECTION,
vectors_config=models.VectorParams(
size=vector_size,
distance=models.Distance.COSINE,
),
)
points = [
models.PointStruct(
id=chunk["id"],
vector=vector,
payload={
"source": chunk["source"],
"chunk": chunk["chunk"],
"text": chunk["text"],
},
)
for chunk, vector in zip(chunks, vectors)
]
client.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks into {COLLECTION}")Three implementation details matter here:
- Paragraph chunking is enough to prove the pattern before you add token-aware chunking.
- Cosine distance is the right default because Ollama’s /api/embed returns L2-normalized vectors.
- Collection size comes from the live embedding output, so you do not hardcode a dimension and get it wrong later.
Step 3: Query and Answer
Once the vectors are in Qdrant, retrieval is a dense-vector nearest-neighbor query. The answer stage is just a grounded chat call that includes the retrieved chunks as context.
def answer_question(question, limit=3):
query_vector = ollama.embed(model=EMBED_MODEL, input=question)["embeddings"][0]
hits = client.query_points(
collection_name=COLLECTION,
query=query_vector,
limit=limit,
).points
context = "\n\n".join(
f"[{hit.payload['source']}#{hit.payload['chunk']}] {hit.payload['text']}"
for hit in hits
)
response = ollama.chat(
model=CHAT_MODEL,
messages=[
{
"role": "system",
"content": "Answer only from the supplied context. If the answer is missing, say that the context does not contain it.",
},
{
"role": "user",
"content": f"Question: {question}\n\nContext:\n{context}",
},
],
)
return hits, response["message"]["content"]
question = "What should an on-call engineer do before restarting the payments worker?"
hits, answer = answer_question(question)
print("Top matches:")
for hit in hits:
print(f"- {hit.payload['source']}#{hit.payload['chunk']} score={hit.score:.3f}")
print("\nAnswer:\n")
print(answer)The main design choice is intentional: llama3 never sees the whole corpus. It only sees the top retrieved chunks. That gives you smaller prompts, lower latency, and much cleaner failure analysis when answers drift.
Verification / Expected Output
Run the indexing block first, then run the query block. A healthy local setup should show all of the following:
- The script prints that chunks were indexed into
private_docs. - The query returns a short list of top matches from the correct source document.
- The final answer references draining the queue before restart, which is the relevant chunk in the sample corpus.
Representative output looks like this:
Indexed 5 chunks into private_docs
Top matches:
- payments-runbook#0 score=0.812
- payments-runbook#1 score=0.774
- incident-handbook#1 score=0.521
Answer:
Before restarting the payments worker, the on-call engineer should drain the queue first.
The context also says to confirm there are no stuck jobs older than 10 minutes.If you want a faster sanity check, hit Qdrant directly with curl http://localhost:6333/collections. A valid response should include your private_docs collection.
Troubleshooting and What's Next
Troubleshooting Top 3
- Connection refused on port 11434
Ollama is not running. Start the desktop app or run ollama serve, then retry the Python script. - Vector size mismatch in Qdrant
You created the collection with one embedding model and queried with another. Recreate the collection or keep the same embedding model for both indexing and search. - Answers sound plausible but miss the exact policy
Your chunks are too broad or your prompt is too soft. Reduce chunk size, increase retrievallimitslightly, and keep the system instruction explicit: answer only from supplied context.
What’s next
- Add metadata filters in Qdrant so teams only search the docs they are allowed to see.
- Replace paragraph splitting with token-aware chunking and overlap.
- Store document IDs, timestamps, and owners in payload for auditability.
- Add a reranking stage if your corpus grows and top-
kretrieval gets noisy. - Snapshot the Qdrant volume so your local index is recoverable.
That is the practical baseline for private RAG in 2026: local embeddings, local vector search, local answer synthesis, and a thin Python layer you can actually inspect. Start here, then harden chunking, authorization, and observability once the retrieval quality is stable.
Frequently Asked Questions
Can I use Llama 3 for embeddings in this RAG stack? +
Why does Qdrant throw a vector size mismatch error? +
Does this private RAG setup keep data on my machine? +
Do I need Docker to use Qdrant for local development? +
:memory: mode for lightweight experiments, but Docker is the better tutorial baseline because it behaves more like a real persisted service and keeps your vectors between runs.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.