Kernel-Bypass Storage Profiling [Deep Dive] NVMe-oF
Bottom Line
The 2026 question is no longer whether kernel-bypass NVMe-oF can clear 10M IOPS on real hardware; validated SPDK RDMA results already do. The hard part is profiling queue ownership, NUMA placement, and software overhead precisely enough to keep those gains outside the lab.
Key Takeaways
- ›SPDK 24.05 RDMA target hit 10.94M 4KiB random-read IOPS with 8 cores
- ›SPDK RDMA initiator reached 5.46M read IOPS on 4 cores and saturated a 200GbE link
- ›SPDK showed up to 5.09x higher IOPS per core than the Linux kernel NVMe-oF target
- ›Target software overhead dropped by up to 2.26 us; initiator overhead dropped by up to 4.94 us
- ›SPDK v26.01 LTS adds NVMe 2.0 target support and RDMA interrupt support
By May 13, 2026, the interesting part of NVMe-over-Fabrics is not raw protocol maturity. It is how far a carefully profiled kernel-bypass data path can push ordinary x86 servers before CPU overhead, queue ownership, or observability mistakes flatten the curve. Recent SPDK data makes the case clearly: 10M+ IOPS is achievable on validated RDMA setups, and the remaining gap between good and great deployments is mostly architectural discipline, not another flash media breakthrough.
- SPDK 24.05 validated 10.94M 4KiB random-read IOPS on an NVMe-oF RDMA target.
- Poll-mode, run-to-completion, and per-thread queue ownership remain the core scaling pattern.
- Kernel-bypass wins biggest on IOPS/core, not just absolute throughput.
- NUMA, queue depth, and connection count matter as much as NIC speed once you move past 5M IOPS.
- SPDK v26.01 LTS signals a broader 2026 shift with NVMe 2.0 target support and RDMA interrupt support.
| Dimension | Kernel-Bypass NVMe-oF | Linux Kernel NVMe-oF | Edge |
|---|---|---|---|
| Fast-path overhead | User-space polling avoids syscalls and interrupt-heavy completion paths | Broader kernel path and scheduler interaction add overhead | Kernel-bypass |
| CPU efficiency | Up to 5.09x higher IOPS/core in SPDK RDMA target testing | Lower efficiency at similar connection counts | Kernel-bypass |
| Latency trimming | Up to 2.26 us lower target overhead and 4.94 us lower initiator overhead | Higher software contribution to round-trip latency | Kernel-bypass |
| Operational simplicity | Requires explicit core, memory, and tracing discipline | Easier fit for conventional fleet operations | Linux kernel |
| Idle efficiency | Pure polling can burn cores unless tuned carefully | Interrupt-driven behavior is simpler for bursty fleets | Linux kernel |
Architecture & Implementation
Bottom Line
The headline number is real, but it comes from a very specific design: dedicated queues, dedicated cores, minimal locks, and aggressive locality. Profile the software path first; buy more hardware second.
SPDK still provides the clearest reference implementation for the kernel-bypass storage model. Its own architecture documents describe a user-space, polled-mode, asynchronous, lockless NVMe driver that maps device BARs directly, submits work through queue pairs, and relies on application polling rather than kernel interrupts. For remote devices, the same model extends into NVMe-oF hosts and targets.
The fast path that matters
On the target side, the important implementation detail is not just that the driver runs in user space. It is that the data path is deliberately structured so ordinary read and write commands stay on the same polling thread from submission through completion.
- Queue pairs are single-thread owned, which removes locks from the I/O path.
- Poll groups do polling and I/O processing on the thread where they were created.
- RDMA zero-copy support avoids intermediate host-memory copies for the transport path.
- Run-to-completion keeps cache locality intact and limits cross-thread coordination.
SPDK's NVMe-oF target programming guide is explicit here: regular READ and WRITE requests do not require cross-thread coordination, while rarer ADMIN operations pause the subsystem briefly rather than forcing locks into every fast-path I/O. That trade is fundamental. It is why the architecture scales so well on read-heavy microbenchmarks and mixed 70/30 workloads.
What the driver actually bypasses
The phrase kernel-bypass can sound larger than it is. In practice, these stacks are not bypassing the operating system entirely. They are bypassing the generic block layer and much of the syscall-and-interrupt churn that conventional storage paths inherit.
- Submission happens directly from the application into device or transport queue structures.
- Completion is discovered by polling with calls such as spdknvmeqpairprocesscompletions().
- Data placement and thread ownership are chosen by the application or target, not by a general scheduler.
- Message passing replaces many lock-based synchronization patterns.
That gives you less abstraction, but also fewer accidental context switches, fewer wakeups, and much less cache thrash at high queue depth.
What changed by 2026
The 2026 story is less about invention than refinement. SPDK v24.09 added NVMe-oF TCP interrupt mode. SPDK v26.01 LTS adds NVMe 2.0 support for the NVMe-oF target and RDMA transport interrupt support. Those features matter because they widen the design space.
- Pure polling still wins the absolute-throughput race.
- Interrupt-capable modes give operators a better option for burstier or power-sensitive deployments.
- NVMe 2.0 target support reduces the pressure to choose between protocol coverage and performance tuning.
In other words, the architecture remains poll-first, but the implementation envelope is becoming easier to productionize.
Profiling Methodology
If you want to understand why one setup clears 10M IOPS and another stalls at half that, profile at three layers at once: transport, target thread model, and backing-device behavior. SPDK's own performance reports are useful precisely because they publish enough setup detail to make the numbers interpretable rather than magical.
- Start with a known hardware topology and document CPU sockets, NUMA nodes, NIC placement, and SSD placement.
- Pin queue ownership to cores before you tune queue depth.
- Separate target-core scaling from initiator-core scaling; they fail in different ways.
- Track aggregate IOPS, average latency, and CPU cores consumed together.
- Validate the write state of drives; burst numbers on non-preconditioned media can mislead.
The SPDK 24.05 NVMe-oF RDMA Performance Report used fio with both the Linux kernel libaio engine and the SPDK bdev engine, plus point-to-point 100GbE links and explicit BIOS performance settings. That is the right pattern: keep the stack simple enough that a regression can be explained.
build/bin/nvmf_tgt -e nvmf_rdma
spdk_nvme_perf -q 128 -o 4096 -w randread -t 600 -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.2 trsvcid:4420'
build/bin/spdk_trace -s nvmf -p 24147The command pattern above comes directly from SPDK documentation: enable nvmf_rdma tracepoints on the target, drive a controlled 4KiB random-read workload, then inspect the trace buffer. That lets you attribute stalls to queue starvation, transport completion pacing, or backend-device saturation instead of arguing from averages.
Benchmarks & Metrics
The most useful official data point for this topic is the SPDK 24.05 RDMA performance report using Intel E810-CQDA2 adapters with RoCEv2. It is not a theoretical claim. It is a documented test matrix with published hardware, BIOS, kernel, and build details.
Target-side core scaling
- 4KiB random read scaled from 1.59M IOPS on 1 core to 7.98M IOPS on 4 cores, then to 10.94M IOPS on 8 cores.
- Average latency on that read test fell from 1125.9 us at 1 core to 225.0 us at 4 cores, then to 163.4 us at 8 cores.
- 70/30 random read/write reached almost 8M IOPS by 4 cores and peaked around network saturation with 10 cores.
- 128KiB sequential read delivered about 48,077.86 MiB/s, showing that the same design is not only a small-I/O trick.
That profile tells you where the real scaling wall lives. For reads, the fast path stays nearly linear until transport bandwidth and per-core polling efficiency begin to flatten the curve. For mixed workloads, the limiter shifts: backend writes, completion pacing, and queue coordination around shared resources start to dominate earlier.
Initiator-side saturation
- The RDMA initiator hit approximately 5.46M 4KiB random-read IOPS with 4 cores, which the report says saturated the 200GbE link.
- 4KiB random write rose from 1.3M IOPS on one core to roughly 3.0M on two cores, then peaked near 5.4M with 8 cores.
- 70/30 random read/write reached 6.2M IOPS with 4 initiator cores before gains became non-linear.
That distinction matters operationally. You can have a brilliantly tuned target and still leave millions of IOPS on the floor if the initiator side is underprovisioned, poorly pinned, or bandwidth-bound long before the SSDs are busy.
Kernel versus kernel-bypass efficiency
- In the same report, the SPDK NVMe-oF target showed up to 5.09x higher IOPS/core than the Linux kernel NVMe-oF target.
- The SPDK target cut round-trip average I/O latency by up to 2.26 us versus the Linux kernel target, or about 11.4% of the kernel target's software overhead.
- The SPDK initiator reduced software overhead by up to 4.94 us versus the Linux kernel initiator, about 28.25% of the kernel initiator's overhead.
Those are the numbers that executives and platform architects should care about. Absolute IOPS is flashy. IOPS per core decides consolidation ratios, rack density, and whether a storage service spends its money on flash, on CPUs, or on the power budget required to keep polling all day.
Strategic Impact
The strategic value of kernel-bypass storage is not that every workload suddenly needs 10M IOPS. It is that once storage software stops wasting cores on the fast path, the economics of a fabric-connected flash tier change.
Why this matters beyond synthetic numbers
- Higher IOPS/core means fewer CPUs reserved for the same storage SLA.
- Lower software overhead gives more room for replication, compression, encryption, or tenant isolation before latency budgets break.
- More predictable per-thread ownership makes performance debugging easier than in heavily shared kernel paths.
- Target and initiator tuning can be reasoned about explicitly instead of inferred from scheduler behavior.
This is why hyperscalers, appliance vendors, and performance-sensitive private clouds keep converging on similar patterns: user-space drivers, explicit polling domains, and fabrics that preserve locality as far as possible.
When to choose kernel-bypass NVMe-oF
- Choose it when your bottleneck is CPU efficiency, not feature breadth.
- Choose it when you can dedicate cores to storage polling and pin them cleanly by NUMA node.
- Choose it when tail latency and aggregate throughput matter more than minimizing idle power.
- Choose it when your engineering team can own benchmarking, tracing, and operational guardrails.
When to choose Linux kernel NVMe-oF
- Choose it when operational simplicity is worth more than maximum IOPS/core.
- Choose it when workloads are bursty enough that interrupt-driven behavior is a better fit than always-on polling.
- Choose it when fleet tooling, packaging, and support assumptions already center on the kernel stack.
- Choose it when absolute performance is sufficient and the cost of specialized tuning is not justified.
The practical takeaway is not ideological. If your target service is still below saturation and dominated by management complexity, the kernel can be the right answer. If your platform is burning sockets just to keep up with queue traffic, kernel-bypass stops being an optimization and becomes the correct architecture.
Road Ahead
The road ahead for NVMe-oF in 2026 is about making the high-performance path less brittle. The big throughput wins are already visible. The next wave is about selective flexibility without surrendering the fast path.
- SPDK v26.01 LTS bringing NVMe 2.0 target support means performance stacks no longer need to lag protocol capability as severely.
- RDMA interrupt support suggests a more nuanced balance between raw polling performance and better behavior under uneven traffic.
- TCP interrupt mode from SPDK v24.09 points in the same direction for deployments that cannot standardize on RDMA everywhere.
- Observability is becoming first-class: tracepoints, JSON-RPC control, and repeatable published reports matter almost as much as another half-million IOPS.
The premium lesson is straightforward. Hitting 10M+ IOPS on NVMe-over-Fabrics is no longer a stunt number reserved for conference slides. It is a documented outcome of disciplined queue ownership, poll-mode design, and tight NUMA-aware profiling. Teams that treat those as software architecture problems, not just tuning knobs, are the ones that will carry lab-grade storage performance into production through the rest of 2026.
Frequently Asked Questions
What is a kernel-bypass storage driver in NVMe-oF? +
How many CPU cores does it take to reach 10M NVMe-oF IOPS? +
Is SPDK always faster than Linux kernel NVMe-oF? +
How should I profile NVMe-oF latency without distorting the result? +
target scaling from initiator scaling. Use SPDK tracepoints, controlled fio or spdknvmeperf jobs, fixed NUMA placement, and known SSD write state so the benchmark reflects software behavior rather than hidden media or scheduler noise.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.