Home Posts HNSW Index Optimization for Billion-Scale Search [2026]
System Architecture

HNSW Index Optimization for Billion-Scale Search [2026]

HNSW Index Optimization for Billion-Scale Search [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 05, 2026 · 11 min read

Bottom Line

At billion scale, HNSW is rarely limited by raw search math alone. The real optimization work is balancing graph density, candidate breadth, filter behavior, and memory compression so recall stays predictable under production load.

Key Takeaways

  • M, efConstruction, and efSearch control most of the recall-latency-memory tradeoff.
  • HNSW scales like a graph system, not just an ANN algorithm: build cost, deletes, merges, and filters matter.
  • Scalar quantization can cut vector memory by 4x with usually under 1% error in Qdrant’s tests.
  • For strict filters, graph connectivity can degrade; filter-aware edges and rescoring become first-class tuning tools.
  • Billion-scale success depends on benchmark discipline: track Recall@K, p95/p99 latency, bytes per vector, and freshness lag together.

A well-tuned HNSW index is still the default answer for high-recall vector retrieval because it replaces linear scans with layered graph traversal that typically scales logarithmically with collection size. But at billion-scale, the easy story ends. The engineering challenge shifts from “which ANN algorithm?” to “how dense can the graph be, how much recall can we buy per byte, and what breaks when filters, deletes, and fresh writes hit the same system?”

The Lead

Bottom Line

For billion-scale vector search, HNSW wins when you treat it as an operational graph system, not a one-time ANN library choice. The biggest gains usually come from memory-aware graph sizing, compression, and filter-safe query planning, not from blindly raising efSearch.

Why HNSW still dominates

The original HNSW paper describes a multilayer proximity graph where upper layers are sparse and lower layers are dense, letting search start with long jumps and finish with local refinement. That hierarchy is what makes the structure practical at large scale: high layers skip most of the search space, while the bottom layer recovers quality. Weaviate’s documentation summarizes the operational consequence cleanly: a flat index grows with linear search cost, while HNSW scales with logarithmic query complexity on large collections.

  • Flat search is simple and memory-light, but latency grows with every added vector.
  • HNSW spends memory on graph links so query latency grows much more slowly.
  • The trade is front-loaded: more memory, more build work, and more operational care in exchange for stable search performance.

Where teams misread the problem

The common failure mode is to frame HNSW as a pure query-time optimization. At tens of millions of vectors, that is still survivable. At hundreds of millions or billions, the dominant questions are different.

  • How many edges per node can you afford in RAM?
  • Can your write path tolerate higher efConstruction during ingestion?
  • Do filtered queries preserve graph connectivity, or do they strand the search in disconnected neighborhoods?
  • Can you compress vectors enough to keep the graph hot without destroying final recall?
Watch out: If you only benchmark unfiltered top-k latency on a warmed cache, you will systematically overestimate production quality. Real failures show up under mixed filter selectivity, shard skew, and freshness pressure.

Architecture & Implementation

The three knobs that dominate outcomes

Most systems expose some variation of the same control surface. In Faiss, OpenSearch 2.19, Qdrant, and hnswlib, the tuning story repeatedly collapses to three parameters.

  • M: graph degree. Higher values create more links, which improves navigability and recall, but increases memory use significantly.
  • efConstruction: build-time candidate breadth. Higher values improve graph quality and usually recall, but slow indexing.
  • efSearch: query-time candidate breadth. Higher values improve recall, but raise latency and CPU cost.

This is why naive scaling fails. Raising efSearch can rescue mediocre recall for a while, but it cannot fully compensate for an under-connected graph. If M is too small for your intrinsic data geometry, search spends more work exploring a bad neighborhood rather than a larger useful one.

The real cost model

Faiss documents IndexHNSWFlat as vector storage plus graph-link overhead that grows with M. That sounds obvious until you price it at billion scale. Even a modest increase in graph degree can translate into hundreds of gigabytes of extra resident memory once you include vectors, adjacency structures, shard replication, and query buffers.

  • Memory is the first constraint, not an afterthought.
  • Index build throughput matters because rebuilding dense graphs is expensive.
  • Delete behavior matters because not every implementation supports cheap removal.

That last point is easy to miss. Faiss notes that HNSW does not support removing vectors because doing so destroys the graph structure. By contrast, Weaviate highlights a custom HNSW implementation with full CRUD support. That is a product-level choice with real architectural consequences: mutable graphs are more operationally flexible, but the system has to pay for that flexibility in maintenance logic, compaction, and repair paths.

Compression is not optional anymore

At billion scale, pure HNSW Flat is often too expensive unless the application can justify premium memory footprints. The practical move is to combine graph navigation with compressed vector storage and selective rescoring.

  • Faiss supports IndexHNSWSQ, IndexHNSWPQ, and IndexHNSW2Level.
  • OpenSearch supports HNSW with Faiss encoders and query-time rescoring via oversample_factor.
  • Qdrant documents scalar quantization, product quantization, and binary quantization as memory/performance levers.

Qdrant gives one of the clearest published numbers here: float32 -> uint8 scalar quantization cuts vector memory by 4x, and its docs say the resulting error is usually below 1% in their experiments. For compatible models, its binary quantization path can reduce memory by up to 32x and deliver up to a 40x speedup. Those are not universal guarantees, but they show why compression has become a first-class architectural tool rather than a last-mile tweak.

method:
  name: hnsw
  engine: faiss
  parameters:
    m: 16
    ef_construction: 256

query:
  k: 50
  method_parameters:
    ef_search: 128
  rescore:
    oversample_factor: 2.0

The pattern is simple: use the graph to find promising candidates fast, then spend precise distance work only where it changes rank quality.

Filters change everything

Unfiltered nearest-neighbor search is the clean demo case. Production retrieval rarely looks like that. Tenant isolation, access control, catalog constraints, and recency windows all introduce filters that can fracture graph traversal.

Qdrant is unusually explicit about this. Its docs warn that strict filters can make the HNSW graph “fall apart,” and describe filterable graph extensions that add extra edges based on indexed payload fields. The same docs recommend creating payload indexes before data ingestion so those extra edges can be built up front. That is a subtle but important systems lesson: filtered vector search is not just “vector search plus a WHERE clause.” It often requires a different graph.

Pro tip: Benchmark at least three filter regimes separately: no filter, weak filter, and strict filter. If only the filtered cases degrade, your graph topology is fine but your filter strategy is not.

Benchmarks & Metrics

What a serious benchmark suite measures

A billion-scale HNSW benchmark should never collapse to one latency number. You need a multidimensional scorecard because different knobs move different parts of the cost surface.

MetricWhy it mattersWhat usually moves it
Recall@KWhether the graph is finding the right neighborhoodM, efConstruction, efSearch, compression choice
p95/p99 latencyWhether the tail is stable enough for productionefSearch, shard skew, filters, cache warmth
Bytes per vectorWhether the design fits real memory budgetsGraph degree, replicas, quantization, payload edges
Build throughputWhether you can ingest or rebuild on scheduleefConstruction, parallelism, vector encoding
Filtered recallWhether predicates break graph traversalPayload indexing, extra edges, fallback rescoring
Freshness lagWhether new vectors become searchable fast enoughSegment merges, async builds, mutation strategy

Published reference points worth anchoring on

  • The original HNSW paper reports logarithmic complexity scaling and states that the method strongly outperformed prior open-source vector-only approaches in its evaluation.
  • Faiss’s billion-scale GPU paper reports k-selection at up to 55% of theoretical peak performance and a nearest-neighbor implementation 8.5x faster than prior GPU state of the art.
  • The same Faiss paper reports building a high-accuracy k-NN graph on 95 million images in 35 minutes, and a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs.
  • OpenSearch documents that higher ef_search improves recall at the cost of latency, and that higher oversample_factor improves accuracy by considering more candidates before ranking.
  • Qdrant documents a default fullscanthreshold of 10000 KB, below which the planner prefers full scan over graph traversal for better performance on small candidate sets.

The important point is not to copy these numbers into your architecture review. It is to copy the methodology. Published ANN wins almost always come from measuring recall, latency, and memory together rather than treating them as independent concerns.

How to run the sweep

  • Generate exact ground truth on a representative subset with Flat search.
  • Sweep M first, because graph quality sets the ceiling for everything downstream.
  • Sweep efConstruction second until build cost stops buying meaningful recall.
  • Sweep efSearch last for service-level tuning.
  • Repeat the full matrix under filtered and unfiltered workloads.
  • Measure tails under concurrent write load if the system is mutable.

If you publish the result internally, clean up benchmark configs and snippets with TechBytes’ Code Formatter so reviewers can diff parameter changes without noise.

Strategic Impact

Why this matters beyond retrieval latency

Optimizing HNSW changes the economics of the whole vector platform.

  • Lower bytes per vector increases tenant density per node.
  • Higher filtered recall reduces the need for expensive application-side retries.
  • Better build throughput shortens backfill and disaster-recovery windows.
  • Smarter rescoring lets teams buy accuracy only where it creates user value.

That is why vector infrastructure teams increasingly talk about HNSW in the language of cost allocation and SLOs, not just ANN theory. The graph is part of your serving path, part of your ingestion path, and part of your cloud bill.

Data governance shows up in benchmark design

The dirty secret of vector benchmarking is that representative payload filters often reveal sensitive business structure: customer tiers, policy flags, catalog partitions, or moderation states. If you build reusable benchmark corpora from production-adjacent metadata, scrub those payload fields before they circulate. TechBytes’ Data Masking Tool is a practical way to anonymize the structured side of those datasets without breaking test workflows.

Road Ahead

The next round of HNSW optimization is less about discovering a new magic parameter and more about composing the index with the rest of the stack.

  • More systems will combine graph search with aggressive compression and selective full-precision reranking.
  • Filter-aware graph augmentation will become standard because strict predicates are now normal, not edge cases.
  • Dynamic switching between Flat and HNSW will matter more for multi-tenant and cold-collection workloads.
  • GPU-assisted preprocessing and graph construction will keep pushing build times down for very large refresh cycles.
  • Write-heavy deployments will keep pressuring vendors to close the gap between immutable high-performance graphs and fully mutable production indexes.

The long-term lesson is straightforward: HNSW remains the workhorse of vector retrieval, but the winning implementation is the one that treats graph topology, filters, compression, and operational lifecycle as one system.

Primary sources: HNSW paper, Faiss index docs, OpenSearch method docs, OpenSearch k-NN query docs, Qdrant indexing docs, Qdrant quantization docs, Weaviate vector index docs, and Billion-scale similarity search with GPUs.

Frequently Asked Questions

What does efSearch do in an HNSW index? +
efSearch controls how many candidate nodes the search explores before returning the top results. Raising it usually improves recall, but it also increases query latency and CPU work. It is the safest runtime knob, but it cannot fully rescue a poorly connected graph.
How do I choose M and efConstruction for HNSW? +
Start by sweeping M because it determines graph connectivity and memory overhead. Then increase efConstruction until recall gains flatten relative to indexing cost. In practice, teams usually find that a stronger graph lets them run a lower efSearch and save latency later.
Does HNSW work well with filtered vector search? +
It can, but strict filters are a common failure mode because they can disconnect the effective search graph. Systems like Qdrant add filter-aware edges to reduce this risk, and some engines rely on oversampling and rescoring. You should benchmark filtered and unfiltered workloads separately.
Is HNSW good for billion-scale vector databases? +
Yes, but only if you design for the full memory and lifecycle cost. At that scale, graph degree, compression, shard layout, and rebuild strategy matter as much as raw query speed. A billion-vector deployment without disciplined memory accounting will hit infrastructure limits before ANN math becomes the bottleneck.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.