Vector DB Benchmarking: HNSW vs DiskANN Deep Dive 2026
Bottom Line
If your graph fits comfortably in memory, HNSW is usually the simpler and faster operational choice. If your dataset is pushing DRAM limits, DiskANN changes the economics by trading some implementation complexity for far higher node density.
Key Takeaways
- ›DiskANN reported >5,000 QPS, <3 ms mean latency, and 95%+ 1-recall@1 on SIFT1B with 64 GB RAM plus SSD
- ›HNSW's strength is RAM-first retrieval with simple tuning around M, ef_construction, and ef
- ›The real benchmark is recall-targeted p95 and p99 latency, not a single QPS number
- ›Use identical embeddings, filters, cache state, and exact ground truth or the comparison is noise
At billion scale, vector search performance stops being a library bake-off and becomes a memory-topology problem. HNSW remains the default choice because it is fast, mature, and easy to reason about when the full graph lives in DRAM. DiskANN matters because it attacks the cost boundary directly: the original NeurIPS 2019 result showed a billion-point index on a single workstation with 64 GB RAM plus SSD, while still delivering high recall and low latency. That is the real comparison to benchmark in 2026.
| Dimension | HNSW | DiskANN | Edge |
|---|---|---|---|
| Primary design goal | Low-latency in-memory ANN | High-density ANN with SSD-resident index | Depends on memory budget |
| Topology | Multi-layer navigable small-world graph | Vamana-style graph with disk-aware access pattern | HNSW for simplicity |
| Scaling behavior | Excellent until DRAM becomes the bottleneck | Designed for billion-scale density on a single node | DiskANN |
| Tuning surface | Usually smaller and easier to explain | More coupled to storage layout and cache behavior | HNSW |
| Tail-latency sensitivity | Mostly CPU and NUMA effects | CPU, NUMA, NVMe queue depth, and cache state | HNSW |
| Cost efficiency at scale | Can become DRAM-heavy | Shifts economics toward SSD plus selective RAM | DiskANN |
Architecture & Implementation
Bottom Line
Choose HNSW when you want the most predictable RAM-first latency profile. Choose DiskANN when your retrieval tier is starting to look like a DRAM procurement problem.
Why HNSW still wins so many production rollouts
The 2016 HNSW paper introduced a hierarchical graph that starts search from sparse upper layers and descends into denser lower layers, achieving logarithmic complexity scaling in practice. That architecture is the reason it is still everywhere: the mental model is simple, the search path is intuitive, and the tuning knobs map cleanly to tradeoffs engineers actually care about.
- M governs graph connectivity and therefore index size, search quality, and construction cost.
- ef_construction buys better graph quality during build at the cost of more CPU time.
- ef is the direct query-time dial for recall versus latency.
- Most operational failures are understandable: memory blowups, long builds, or degraded recall under undersized parameters.
In practice, HNSW is strongest when your embeddings, graph links, and auxiliary metadata all fit comfortably in memory. In that regime, the engine is mostly fighting CPU cache locality, thread scheduling, and NUMA placement, not storage latency. That makes it easier to benchmark and easier to explain to the rest of the organization.
What DiskANN changes at billion scale
DiskANN starts from a harsher premise: main memory is too expensive to hold the whole structure at the scale teams actually want. Its original paper showed that a billion-point database could be indexed, stored, and searched on a single workstation with just 64 GB RAM and an SSD, while serving more than 5,000 queries per second at under 3 ms mean latency and 95%+ 1-recall@1 on SIFT1B. That is not just a speed result. It is a density result.
The architectural shift is straightforward but operationally significant.
- The graph is built to support efficient greedy traversal with disk awareness.
- Only the hottest structures stay resident in RAM; the wider index can live on NVMe.
- Performance depends on how well the system minimizes random reads and exploits beam-style search behavior.
- Storage characteristics become part of the retrieval algorithm, not just deployment plumbing.
This is why benchmarking DiskANN badly is easy. If your test harness does not control SSD class, queue depth, warm versus cold cache state, and concurrency, you are not benchmarking the algorithm. You are benchmarking storage noise.
Benchmarks & Metrics
The benchmark design that actually answers the decision
Most teams start with a benchmark question that is too vague: “Which index is faster?” The useful question is narrower: “Which index hits our recall target, under our concurrency, on our hardware, at our freshness requirements, within our budget?” That forces the methodology to match production reality.
- Fix the dataset and embedding model across all runs. Do not compare different vector distributions.
- Generate exact ground truth with brute-force or a trusted exact baseline on the same query set.
- Benchmark at recall targets, not at arbitrary default settings.
- Separate warm-cache and cold-cache runs.
- Measure single-query, batch, and concurrent-query behavior independently.
- Include filtered retrieval if production search is not pure ANN.
- Record ingest cost, rebuild time, and update behavior, not just search latency.
ANN-Benchmarks remains useful because it standardizes recall-versus-throughput curves across many ANN implementations. But its public plots are not enough for a 2026 infrastructure decision on billion-scale embeddings. The moment SSD residency, multi-tenant interference, or metadata filters enter the picture, you need a custom harness.
benchmark:
dataset:
vectors: production-sample-or-sift1b
dimensionality: fixed
distance: cosine-or-l2
ground_truth:
method: exact
k: 10
runs:
- index: HNSW
target_recall: [0.90, 0.95, 0.99]
cache_state: [warm, cold]
- index: DiskANN
target_recall: [0.90, 0.95, 0.99]
cache_state: [warm, cold]
metrics:
- qps
- p50_latency_ms
- p95_latency_ms
- p99_latency_ms
- build_hours
- ram_per_vector
- ssd_bytes_per_vector
- update_cost
- recall_at_10The metrics that separate pretty demos from production systems
At this level, one headline QPS figure is almost meaningless. The important metrics are the ones that reveal operational shape.
- Recall-targeted latency: Compare p50, p95, and p99 only at the same recall.
- RAM per vector: This usually decides whether HNSW remains viable on a given node class.
- SSD footprint and read amplification: Essential for DiskANN economics and tail behavior.
- Build throughput: A slow rebuild can turn a good search index into a bad platform choice.
- Freshness cost: Incremental updates, background compaction, and rebuild cadence matter in RAG and recommender systems.
- Concurrency stability: Some systems look great in single-thread tests and collapse at realistic QPS.
The most common outcome is not that one index dominates every chart. It is that HNSW wins the low-complexity DRAM-friendly lane, while DiskANN wins the density-per-node lane. If your benchmark does not expose that boundary, the methodology is too blunt.
When to Choose HNSW or DiskANN
Choose HNSW when:
- Your full serving index fits in DRAM with acceptable headroom.
- You need predictable low-latency behavior without depending on NVMe characteristics.
- Your team wants a smaller tuning surface and simpler failure analysis.
- You expect rapid iteration on search quality and do not want storage layout to become part of every experiment.
- Your platform standardizes on engines and libraries where HNSW support is the most mature path.
Choose DiskANN when:
- Your vector corpus is large enough that HNSW turns into a memory-capacity or memory-cost problem.
- You want higher points-per-node density and can invest in disciplined benchmark engineering.
- Your infrastructure already treats NVMe performance as a first-class production variable.
- You can tolerate a more coupled system where graph quality, cache residency, and storage behavior interact.
- You care more about single-node scale efficiency than about having the simplest possible ANN architecture.
A useful rule of thumb is this: if your planned growth curve says “buy more RAM” every quarter, you should benchmark DiskANN immediately. If your main pain is shaving a few milliseconds while the dataset still fits in memory, stay with HNSW until the economics force a harder move.
Strategic Impact
This is a systems-budget decision, not just an algorithm decision
Vector retrieval has moved from experimental feature to always-on platform layer. That changes how engineering leaders should read benchmarks. The key variable is no longer just whether a method reaches a high recall target; it is whether the method does so inside the cost envelope of your serving fleet.
- HNSW often minimizes engineering friction but can maximize DRAM spend at large scale.
- DiskANN can reduce the memory burden per indexed vector, but it raises the bar for hardware-aware operations.
- Multi-tenant environments exaggerate the difference because storage interference and noisy-neighbor effects hit disk-backed paths harder.
- Disaster recovery, replication, and rebuild procedures change when the index is no longer a purely in-memory artifact.
That strategic tradeoff matters in several 2026 workloads.
- RAG platforms care about freshness, tenant isolation, and tail latency under bursty traffic.
- Recommendation systems care about dense catalogs, batch query patterns, and rebuild cadence.
- Multimodal search raises vector dimensionality and footprint, making the memory boundary arrive sooner.
The deeper point is that benchmark winners can invert when the organization scales. A team that sees HNSW win cleanly at tens of millions of vectors may still choose DiskANN because the next order of magnitude is already visible in the product roadmap.
Road Ahead
The next wave of vector benchmarking is moving beyond pure ANN speed charts. The harder questions are about filtered retrieval, hybrid ranking pipelines, and continuously mutating corpora.
- Filtered vector search will matter more, because real systems increasingly combine ANN with tenant, policy, and metadata constraints.
- Auto-tuning will matter more, because the optimal settings depend on recall target, vector distribution, and hardware class.
- Hybrid memory-disk designs will keep improving, because they align better with billion-scale economics than all-DRAM assumptions.
- Distributed graph search will matter more, because single-node density is only step one for global-scale applications.
- GPU re-ranking and compressed candidate stages will increasingly blur the line between retrieval and ranking infrastructure.
That is why the durable takeaway is not “HNSW beats DiskANN” or the reverse. The durable takeaway is that vector database benchmarking has to be architecture-aware. HNSW is still the most practical answer for many RAM-first deployments. DiskANN is the stronger answer when billion-scale density is the binding constraint. In 2026, the teams that benchmark those realities honestly will make better platform decisions than the teams chasing a single benchmark headline.
Frequently Asked Questions
Is HNSW faster than DiskANN for vector search? +
HNSW often delivers simpler and more predictable low-latency behavior. DiskANN becomes compelling when scale pushes the index beyond practical DRAM limits and SSD-backed density starts to dominate the cost model.What should I measure in an HNSW vs DiskANN benchmark? +
p95 and p99 latency, QPS, build time, RAM per vector, SSD footprint, and update cost. A single throughput number is not enough because the two methods optimize different parts of the serving stack.Why do vendor vector benchmarks often disagree? +
When does DiskANN become worth the added complexity? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.