HNSW Index Optimization for Billion-Scale Search [2026]
Bottom Line
At billion scale, HNSW is rarely limited by raw search math alone. The real optimization work is balancing graph density, candidate breadth, filter behavior, and memory compression so recall stays predictable under production load.
Key Takeaways
- ›M, efConstruction, and efSearch control most of the recall-latency-memory tradeoff.
- ›HNSW scales like a graph system, not just an ANN algorithm: build cost, deletes, merges, and filters matter.
- ›Scalar quantization can cut vector memory by 4x with usually under 1% error in Qdrant’s tests.
- ›For strict filters, graph connectivity can degrade; filter-aware edges and rescoring become first-class tuning tools.
- ›Billion-scale success depends on benchmark discipline: track Recall@K, p95/p99 latency, bytes per vector, and freshness lag together.
A well-tuned HNSW index is still the default answer for high-recall vector retrieval because it replaces linear scans with layered graph traversal that typically scales logarithmically with collection size. But at billion-scale, the easy story ends. The engineering challenge shifts from “which ANN algorithm?” to “how dense can the graph be, how much recall can we buy per byte, and what breaks when filters, deletes, and fresh writes hit the same system?”
The Lead
Bottom Line
For billion-scale vector search, HNSW wins when you treat it as an operational graph system, not a one-time ANN library choice. The biggest gains usually come from memory-aware graph sizing, compression, and filter-safe query planning, not from blindly raising efSearch.
Why HNSW still dominates
The original HNSW paper describes a multilayer proximity graph where upper layers are sparse and lower layers are dense, letting search start with long jumps and finish with local refinement. That hierarchy is what makes the structure practical at large scale: high layers skip most of the search space, while the bottom layer recovers quality. Weaviate’s documentation summarizes the operational consequence cleanly: a flat index grows with linear search cost, while HNSW scales with logarithmic query complexity on large collections.
- Flat search is simple and memory-light, but latency grows with every added vector.
- HNSW spends memory on graph links so query latency grows much more slowly.
- The trade is front-loaded: more memory, more build work, and more operational care in exchange for stable search performance.
Where teams misread the problem
The common failure mode is to frame HNSW as a pure query-time optimization. At tens of millions of vectors, that is still survivable. At hundreds of millions or billions, the dominant questions are different.
- How many edges per node can you afford in RAM?
- Can your write path tolerate higher efConstruction during ingestion?
- Do filtered queries preserve graph connectivity, or do they strand the search in disconnected neighborhoods?
- Can you compress vectors enough to keep the graph hot without destroying final recall?
Architecture & Implementation
The three knobs that dominate outcomes
Most systems expose some variation of the same control surface. In Faiss, OpenSearch 2.19, Qdrant, and hnswlib, the tuning story repeatedly collapses to three parameters.
- M: graph degree. Higher values create more links, which improves navigability and recall, but increases memory use significantly.
- efConstruction: build-time candidate breadth. Higher values improve graph quality and usually recall, but slow indexing.
- efSearch: query-time candidate breadth. Higher values improve recall, but raise latency and CPU cost.
This is why naive scaling fails. Raising efSearch can rescue mediocre recall for a while, but it cannot fully compensate for an under-connected graph. If M is too small for your intrinsic data geometry, search spends more work exploring a bad neighborhood rather than a larger useful one.
The real cost model
Faiss documents IndexHNSWFlat as vector storage plus graph-link overhead that grows with M. That sounds obvious until you price it at billion scale. Even a modest increase in graph degree can translate into hundreds of gigabytes of extra resident memory once you include vectors, adjacency structures, shard replication, and query buffers.
- Memory is the first constraint, not an afterthought.
- Index build throughput matters because rebuilding dense graphs is expensive.
- Delete behavior matters because not every implementation supports cheap removal.
That last point is easy to miss. Faiss notes that HNSW does not support removing vectors because doing so destroys the graph structure. By contrast, Weaviate highlights a custom HNSW implementation with full CRUD support. That is a product-level choice with real architectural consequences: mutable graphs are more operationally flexible, but the system has to pay for that flexibility in maintenance logic, compaction, and repair paths.
Compression is not optional anymore
At billion scale, pure HNSW Flat is often too expensive unless the application can justify premium memory footprints. The practical move is to combine graph navigation with compressed vector storage and selective rescoring.
- Faiss supports IndexHNSWSQ, IndexHNSWPQ, and IndexHNSW2Level.
- OpenSearch supports HNSW with Faiss encoders and query-time rescoring via oversample_factor.
- Qdrant documents scalar quantization, product quantization, and binary quantization as memory/performance levers.
Qdrant gives one of the clearest published numbers here: float32 -> uint8 scalar quantization cuts vector memory by 4x, and its docs say the resulting error is usually below 1% in their experiments. For compatible models, its binary quantization path can reduce memory by up to 32x and deliver up to a 40x speedup. Those are not universal guarantees, but they show why compression has become a first-class architectural tool rather than a last-mile tweak.
method:
name: hnsw
engine: faiss
parameters:
m: 16
ef_construction: 256
query:
k: 50
method_parameters:
ef_search: 128
rescore:
oversample_factor: 2.0The pattern is simple: use the graph to find promising candidates fast, then spend precise distance work only where it changes rank quality.
Filters change everything
Unfiltered nearest-neighbor search is the clean demo case. Production retrieval rarely looks like that. Tenant isolation, access control, catalog constraints, and recency windows all introduce filters that can fracture graph traversal.
Qdrant is unusually explicit about this. Its docs warn that strict filters can make the HNSW graph “fall apart,” and describe filterable graph extensions that add extra edges based on indexed payload fields. The same docs recommend creating payload indexes before data ingestion so those extra edges can be built up front. That is a subtle but important systems lesson: filtered vector search is not just “vector search plus a WHERE clause.” It often requires a different graph.
Benchmarks & Metrics
What a serious benchmark suite measures
A billion-scale HNSW benchmark should never collapse to one latency number. You need a multidimensional scorecard because different knobs move different parts of the cost surface.
| Metric | Why it matters | What usually moves it |
|---|---|---|
| Recall@K | Whether the graph is finding the right neighborhood | M, efConstruction, efSearch, compression choice |
| p95/p99 latency | Whether the tail is stable enough for production | efSearch, shard skew, filters, cache warmth |
| Bytes per vector | Whether the design fits real memory budgets | Graph degree, replicas, quantization, payload edges |
| Build throughput | Whether you can ingest or rebuild on schedule | efConstruction, parallelism, vector encoding |
| Filtered recall | Whether predicates break graph traversal | Payload indexing, extra edges, fallback rescoring |
| Freshness lag | Whether new vectors become searchable fast enough | Segment merges, async builds, mutation strategy |
Published reference points worth anchoring on
- The original HNSW paper reports logarithmic complexity scaling and states that the method strongly outperformed prior open-source vector-only approaches in its evaluation.
- Faiss’s billion-scale GPU paper reports k-selection at up to 55% of theoretical peak performance and a nearest-neighbor implementation 8.5x faster than prior GPU state of the art.
- The same Faiss paper reports building a high-accuracy k-NN graph on 95 million images in 35 minutes, and a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs.
- OpenSearch documents that higher ef_search improves recall at the cost of latency, and that higher oversample_factor improves accuracy by considering more candidates before ranking.
- Qdrant documents a default fullscanthreshold of 10000 KB, below which the planner prefers full scan over graph traversal for better performance on small candidate sets.
The important point is not to copy these numbers into your architecture review. It is to copy the methodology. Published ANN wins almost always come from measuring recall, latency, and memory together rather than treating them as independent concerns.
How to run the sweep
- Generate exact ground truth on a representative subset with Flat search.
- Sweep M first, because graph quality sets the ceiling for everything downstream.
- Sweep efConstruction second until build cost stops buying meaningful recall.
- Sweep efSearch last for service-level tuning.
- Repeat the full matrix under filtered and unfiltered workloads.
- Measure tails under concurrent write load if the system is mutable.
If you publish the result internally, clean up benchmark configs and snippets with TechBytes’ Code Formatter so reviewers can diff parameter changes without noise.
Strategic Impact
Why this matters beyond retrieval latency
Optimizing HNSW changes the economics of the whole vector platform.
- Lower bytes per vector increases tenant density per node.
- Higher filtered recall reduces the need for expensive application-side retries.
- Better build throughput shortens backfill and disaster-recovery windows.
- Smarter rescoring lets teams buy accuracy only where it creates user value.
That is why vector infrastructure teams increasingly talk about HNSW in the language of cost allocation and SLOs, not just ANN theory. The graph is part of your serving path, part of your ingestion path, and part of your cloud bill.
Data governance shows up in benchmark design
The dirty secret of vector benchmarking is that representative payload filters often reveal sensitive business structure: customer tiers, policy flags, catalog partitions, or moderation states. If you build reusable benchmark corpora from production-adjacent metadata, scrub those payload fields before they circulate. TechBytes’ Data Masking Tool is a practical way to anonymize the structured side of those datasets without breaking test workflows.
Road Ahead
The next round of HNSW optimization is less about discovering a new magic parameter and more about composing the index with the rest of the stack.
- More systems will combine graph search with aggressive compression and selective full-precision reranking.
- Filter-aware graph augmentation will become standard because strict predicates are now normal, not edge cases.
- Dynamic switching between Flat and HNSW will matter more for multi-tenant and cold-collection workloads.
- GPU-assisted preprocessing and graph construction will keep pushing build times down for very large refresh cycles.
- Write-heavy deployments will keep pressuring vendors to close the gap between immutable high-performance graphs and fully mutable production indexes.
The long-term lesson is straightforward: HNSW remains the workhorse of vector retrieval, but the winning implementation is the one that treats graph topology, filters, compression, and operational lifecycle as one system.
Primary sources: HNSW paper, Faiss index docs, OpenSearch method docs, OpenSearch k-NN query docs, Qdrant indexing docs, Qdrant quantization docs, Weaviate vector index docs, and Billion-scale similarity search with GPUs.
Frequently Asked Questions
What does efSearch do in an HNSW index? +
How do I choose M and efConstruction for HNSW? +
Does HNSW work well with filtered vector search? +
Is HNSW good for billion-scale vector databases? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.