Vector Sharding with Qdrant on Kubernetes [2026] Guide
Bottom Line
On self-hosted Kubernetes, Qdrant scales semantic search well, but shard planning is not automatic. If you choose the right shard count up front and verify distribution continuously, you avoid the two most expensive mistakes: hot nodes and forced collection rebuilds.
Key Takeaways
- ›Use a 3-node StatefulSet for resilient self-hosted Qdrant clusters.
- ›Set shard_number as a multiple of node count; Qdrant recommends at least 2 shards per node.
- ›Enable distributed mode with config.cluster.enabled: true; peer traffic uses 6335.
- ›Self-hosted scale-out does not rebalance shards automatically; use the cluster API to move shards.
Vector search usually fails to scale for boring reasons: one node gets too hot, shard counts are chosen once and regretted forever, and cluster expansion happens without verifying where data actually lives. Qdrant gives you the building blocks to avoid that, and Kubernetes gives you stable identities and storage through StatefulSets. The trick is understanding that distributed mode, replication, and shard movement are related but separate controls. This walkthrough shows the setup that holds up under real growth.
Prerequisites and shard plan
Prerequisites
- A working Kubernetes cluster with dynamic persistent volumes.
- kubectl and helm installed locally.
- Network access between Qdrant peers on 6335 and client/API access on 6333.
- A namespace where you can run a StatefulSet with at least three replicas.
- A shard plan before first ingest, because shard_number cannot be changed later in self-hosted open source deployments without recreating the collection.
Bottom Line
Treat shard count as a capacity decision, not a tuning afterthought. In self-hosted Qdrant on Kubernetes, scale-out is straightforward, but even distribution still depends on you.
As of May 01, 2026, the public Qdrant Helm chart site lists chart 1.17.1, which deploys Qdrant v1.17.1. Qdrant’s distributed deployment guide also recommends using a shard count that is a multiple of your node count, and at least 2 shards per node if you want headroom for future expansion.
- For a 3-node cluster, start with 6 or 12 shards.
- Use replication_factor: 2 when you want one extra copy without tripling storage.
- Use replication_factor: 3 when read availability matters more than storage cost.
- Remember that scaling the StatefulSet is not the same thing as redistributing existing shards.
Step 1: Deploy a distributed cluster
The official Helm chart already uses a StatefulSet, which is exactly what you want for stable pod names and persistent storage. The minimum changes for a distributed cluster are increasing replicaCount and enabling config.cluster.enabled.
Create the values file
replicaCount: 3
config:
cluster:
enabled: true
persistence:
size: 100GiIf you want to clean up the YAML before committing it, run it through TechBytes’ Code Formatter. That is especially useful once you start layering storage classes, resource requests, and affinity rules into the same values file.
Install the chart
helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo update
helm upgrade -i qdrant qdrant/qdrant \
--namespace vector-search \
--create-namespace \
-f values.yamlWait until all pods are ready:
kubectl get pods -n vector-search -wExpected state:
- Three Qdrant pods become
RunningandREADY 1/1. - The workload type is a StatefulSet, not a stateless Deployment.
- Persistent volume claims are bound for each replica.
Step 2: Create a sharded collection
Now create a collection with shard math that matches the cluster you just deployed. For a 3-node cluster, 6 shards is a practical starting point because it spreads work evenly and still leaves room for moderate growth.
Forward the API locally
kubectl port-forward -n vector-search svc/qdrant 6333:6333Create the collection
curl -X PUT http://localhost:6333/collections/docs \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"size": 4,
"distance": "Cosine"
},
"shard_number": 6,
"replication_factor": 2
}'This example uses a tiny 4-dimensional vector just to verify the cluster mechanics quickly. In production, your vector size must match the embedding model you actually use. The important part here is the relationship between shard_number, node count, and replication.
- shard_number controls how many logical partitions the collection has.
- replication_factor controls how many copies of each shard exist.
- More shards improve distribution flexibility, but too many add overhead.
Step 3: Load data and run queries
Before you benchmark anything, prove that writes, reads, and fan-out search work across the cluster. Use a small synthetic dataset first. If you are testing with real payloads that include customer or document metadata, scrub them before sharing or debugging with TechBytes’ Data Masking Tool.
Upsert sample points
curl -X PUT 'http://localhost:6333/collections/docs/points?wait=true' \
-H 'Content-Type: application/json' \
-d '{
"points": [
{
"id": 1,
"vector": [0.91, 0.05, 0.02, 0.02],
"payload": {"topic": "kubernetes", "tier": "infra"}
},
{
"id": 2,
"vector": [0.10, 0.80, 0.05, 0.05],
"payload": {"topic": "qdrant", "tier": "infra"}
},
{
"id": 3,
"vector": [0.12, 0.10, 0.70, 0.08],
"payload": {"topic": "security", "tier": "platform"}
}
]
}'The official API returns an acknowledged operation when the write succeeds. Then issue a query through the newer universal query endpoint instead of the older deprecated search route.
Run a nearest-neighbor query
curl -X POST http://localhost:6333/collections/docs/points/query \
-H 'Content-Type: application/json' \
-d '{
"query": [0.88, 0.06, 0.03, 0.03],
"limit": 2,
"with_payload": true
}'Expected result:
- A top match with
idclose to 1. - A response object containing
status: ok. - Points returned under
result.points, which confirms you are using the current query API.
Verify and inspect shard placement
This is the step teams skip, and it is where most scale problems start. You need to verify both cluster health and actual collection distribution.
Check cluster status
curl http://localhost:6333/clusterExpected output:
status: okat the top level.- A non-disabled cluster result once distributed mode is active.
- Peer information showing the cluster is composed rather than running as a single standalone node.
Inspect collection shard distribution
curl http://localhost:6333/collections/docs/clusterExpected output:
shard_countequals 6.local_shardsandremote_shardsentries show where shards live.- Shard states appear as
Activeafter the collection settles.
If you later scale from three pods to four, do not assume the new pod receives useful work automatically. Self-hosted Qdrant exposes the cluster API for manual movement:
curl -X POST http://localhost:6333/collections/docs/cluster \
-H 'Content-Type: application/json' \
-d '{
"move_shard": {
"shard_id": 0,
"from_peer_id": 111111111,
"to_peer_id": 222222222,
"method": "stream_records"
}
}'Use the real peer IDs from your cluster output. This is the key operational distinction: Kubernetes adds compute, but Qdrant needs explicit shard placement changes to use that compute well.
Troubleshooting and what’s next
Troubleshooting top 3
- New nodes stay mostly idle after scale-out. This is expected in self-hosted Qdrant. Scaling the StatefulSet adds peers, but existing shards are not auto-rebalanced; inspect
/collections/docs/clusterand move shards explicitly. - The collection cannot use all peers evenly. Your shard_number is probably too low or not a clean multiple of the current node count. In self-hosted deployments, the fix is usually creating a new collection and migrating data.
- Helm upgrade fails on immutable StatefulSet fields. The Helm chart documentation calls out StatefulSet immutability. If you change forbidden spec fields, Kubernetes rejects the patch; plan storage and service shape early, and use snapshots before disruptive changes.
What’s next
- Add workload-aware benchmarking so you measure query latency by payload filter and shard fan-out, not just by raw QPS.
- Introduce snapshots and restore drills before you call the cluster production-ready.
- Wrap collection creation in infrastructure code so shard_number and replication_factor are reviewed, not improvised.
- If you need automatic rebalancing or live resharding, evaluate whether Qdrant Cloud fits better than a fully self-managed Kubernetes path.
Frequently Asked Questions
How many shards should I use for Qdrant on Kubernetes? +
Does Qdrant automatically rebalance shards after I scale my Kubernetes StatefulSet? +
POST /collections/{collection}/cluster operations to move shards when distribution is uneven.Can I change shard_number after creating a Qdrant collection? +
shard_number as effectively fixed for the lifetime of the collection. If you got the number wrong, the practical fix is usually to create a new collection with the right layout and migrate data.Why use a Kubernetes StatefulSet instead of a Deployment for Qdrant? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Kubernetes StatefulSets for Vector Databases
A practical guide to stable identities, storage, and rollout behavior for stateful AI workloads.
System ArchitectureQdrant vs pgvector: Architecture Tradeoffs
A system-level comparison of operational complexity, scale patterns, and query behavior.
AI EngineeringSemantic Search Benchmarks That Actually Matter
How to benchmark vector search systems using latency, recall, filtering, and shard-aware load.