What DPDK setting usually affects micro-burst latency first?

Start with burst depth. In current testpmd documentation, the default is --burst=32; that often favors throughput over HFT tail latency. Test 4, 8, and 16 under the same replay profile and compare p99 and p99.9.

Should I use vfio-pci or UIO for low-latency DPDK?

Use vfio-pci unless you have a hard blocker. DPDK's Linux driver guide recommends vfio-pci in all cases, with no-IOMMU mode available only when an IOMMU is not present. It is the more robust and secure default.

Do hugepages reduce latency or just improve throughput?

They help both, but for different reasons. DPDK requires hugepages for its large packet-buffer pools, and the documentation explains that fewer, larger pages reduce TLB pressure. In practice, that removes a source of jitter and makes latency more predictable under load.

Why does NUMA alignment matter so much in trading systems?

A remote NUMA access adds variance exactly where HFT systems cannot tolerate it: the hot polling loop. DPDK's platform tuning guide recommends keeping cores and NICs on the same socket, and that advice is even more important for tail latency than for average throughput.

DPDK Micro-Burst Latency Tuning for HFT Systems [2026]

Micro-burst latency is where many high-frequency trading stacks lose determinism: not during steady-state throughput, but when a few dozen packets arrive faster than a busy core can drain them. DPDK gives you the tools to remove that jitter, but the defaults are not optimized for HFT tail latency. This tutorial shows a practical, source-backed path to reduce burst amplification using current DPDK 25.11 guidance, vfio-pci, hugepages, CPU isolation, NUMA locality, and smaller poll bursts.

Default matters: testpmd still defaults to --burst=32, which is often too throughput-oriented for HFT tails.
Placement matters: DPDK recommends NUMA-aware allocation and same-socket NIC/core placement.
Isolation matters: Linux can still schedule timers, IRQs, and RCU work onto your polling cores unless you isolate them.
Memory matters: DPDK recommends 1 GB hugepages for 64-bit apps when supported.

Prerequisites

Prerequisites Box

Linux host with kernel >= 5.4.
A DPDK-supported NIC dedicated to the trading or market-data path.
BIOS access to enable IOMMU or VT-d when using vfio-pci.
Build tools required by DPDK: GCC 8+ or Clang 7+, Python 3.6+, Meson 0.57+, ninja, and NUMA development libraries.
Root access for hugepage reservation and NIC rebinding.
A baseline replay or market-data generator so you can compare p99 and p99.9 latency before and after each change.

Bottom Line

For HFT, reduce queueing before you chase peak packets per second. The fastest fix is usually a combination of vfio-pci, isolated polling cores, NUMA-local memory, and a smaller --burst setting than the default.

Implementation Steps

1. Reserve hugepages and isolate the right cores

DPDK's Linux guide states that hugepages are required for large packet-buffer pools, and recommends 1 GB hugepages for 64-bit applications when the platform supports them. Its core-isolation section also calls out isolcpus, nohz_full, and irqaffinity to keep scheduler noise off polling cores.

Edit your kernel command line so hugepages and isolated cores are configured at boot.
Keep core 0 out of the polling set; DPDK explicitly notes it cannot be fully isolated.
Pin only your low-latency threads to those isolated CPUs.

# /etc/default/grub
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=4 isolcpus=2,4,6 nohz_full=2,4,6 irqaffinity=0,1,3,5,7 iommu=on"

sudo update-grub
sudo reboot

# After reboot
grep -E 'HugePages_Total|Hugepagesize' /proc/meminfo

If your platform does not support 1 GB pages, reserve 2 MB pages instead. On multi-socket hosts, reserve pages on the same NUMA node as the NIC you will use.

2. Build DPDK cleanly and bind the NIC to vfio-pci

DPDK's Linux driver guide recommends vfio-pci for DPDK-bound ports in all cases. Build a known-good version first, then move the NIC away from the kernel network driver.

tar xJf dpdk-25.11.1.tar.xz
cd dpdk-25.11.1
meson setup -Dplatform=native build
cd build
ninja
sudo meson install
sudo ldconfig

sudo modprobe vfio-pci
sudo ../usertools/dpdk-devbind.py --status
sudo ../usertools/dpdk-devbind.py --bind=vfio-pci 0000:82:00.0

Use -Dplatform=native only on hosts where you control the deployment target. If you are packaging a portable artifact, switch to -Dplatform=generic. If you adapt the shell snippets for your runbooks, the Code Formatter helps keep command examples readable and consistent.

3. Keep NIC, memory, and lcores on one NUMA socket

DPDK's Intel platform performance guide is blunt here: to get the best performance, ensure the cores and NIC are on the same socket. For latency, this is not optional. Remote memory and cross-socket traffic add variance even when average throughput still looks fine.

# Find the NIC and its NUMA node
lspci -nn | grep Eth
cat /sys/bus/pci/devices/0000:82:00.0/numa_node

# Example DPDK launch pattern with explicit lcores and NUMA memory
sudo dpdk-testpmd \
  --lcores='1@2,2@4,3@6' \
  --main-lcore=1 \
  --numa-mem 1024,0 \
  -a 0000:82:00.0 \
  -- \
  --port-topology=loop \
  --forward-mode=io \
  --rxq=3 --txq=3

The exact socket map differs by server, so verify it instead of assuming PCI slot placement. NUMA mismatches are one of the easiest ways to create bursty tail spikes that disappear under light load and reappear in production.

4. Reduce software burst depth before touching the strategy code

This is the high-value tuning step. In current DPDK documentation, testpmd defaults to --burst=32, --rxd=128, and --txd=512. Those defaults are sensible for general forwarding, but HFT paths often prefer lower queue occupancy over maximum bulk efficiency. A smaller burst drains packets sooner, which usually improves p99 and p99.9 latency at the cost of some peak throughput.

Watch out: Treat smaller bursts as a latency optimization, not a universal win. Measure p50, p99, p99.9, drops, and CPU headroom together; the best burst for 10 GbE replay traffic may be wrong for a 25 or 100 GbE production path.

# Baseline with documented defaults
sudo dpdk-testpmd \
  --lcores='1@2,2@4,3@6' \
  --main-lcore=1 \
  --numa-mem 1024,0 \
  -a 0000:82:00.0 \
  -- \
  --forward-mode=io \
  --rxq=3 --txq=3 \
  --rxd=128 --txd=512 \
  --burst=32 --stats-period=1

# Latency-oriented run
sudo dpdk-testpmd \
  --lcores='1@2,2@4,3@6' \
  --main-lcore=1 \
  --numa-mem 1024,0 \
  -a 0000:82:00.0 \
  -- \
  --forward-mode=io \
  --rxq=3 --txq=3 \
  --rxd=64 --txd=128 \
  --burst=4 --stats-period=1

For your own application, the same principle usually means capping the receive loop to a small batch and transmitting sooner instead of waiting to accumulate work:

enum { RX_BURST = 4 };
struct rte_mbuf *pkts[RX_BURST];

const uint16_t n = rte_eth_rx_burst(port_id, queue_id, pkts, RX_BURST);
for (uint16_t i = 0; i < n; i++) {
    rte_prefetch0(rte_pktmbuf_mtod(pkts[i], void *));
    /* parse, decide, enqueue or send immediately */
}
if (n) {
    rte_eth_tx_burst(port_id, txq, pkts, n);
}

This is an engineering inference rather than a quoted DPDK recommendation, but it follows directly from how software queueing amplifies micro-bursts: waiting to fill a larger batch improves amortization and hurts time-to-first-packet.

Verification

Do not accept the change because CPU utilization dropped or throughput stayed flat. The point is lower tail latency under bursty load.

Confirm hugepages are available and correctly sized.
Confirm the NIC is bound to vfio-pci.
Confirm the NIC's NUMA node matches the cores and memory you allocated.
Replay the same burst profile for each run: 32, 16, 8, 4.
Record p50, p99, p99.9, drops, and CPU headroom for each profile.

# Hugepages
grep -E 'HugePages_Total|Hugepagesize' /proc/meminfo

# NIC binding
sudo ../usertools/dpdk-devbind.py --status

# NUMA locality
cat /sys/bus/pci/devices/0000:82:00.0/numa_node

Expected output looks like this:

HugePages_Total: 4
Hugepagesize: 1048576 kB

0000:82:00.0 'Ethernet controller ...' drv=vfio-pci unused=ixgbe

0

If your tuning worked, you should see lower tail latency during burst replay with no unexpected increase in drops. Throughput may stay the same or dip slightly; for HFT, that trade is often correct.

Troubleshooting Top 3

Problem: DPDK cannot use the NIC. Cause: The interface is still attached to the kernel driver or the wrong IOMMU group is partially bound. Fix: Recheck dpdk-devbind.py --status, load vfio-pci, and verify all required devices in the IOMMU group are handled correctly.
Problem: Latency is unstable even after reducing --burst. Cause: Linux is still scheduling work on polling cores. Fix: Revisit isolcpus, nohz_full, and irqaffinity, and keep core 0 out of the dataplane.
Problem: Average latency improves but p99.9 gets worse under load. Cause: Descriptor counts are still too deep, or NUMA locality is broken. Fix: Try smaller --rxd and --txd values, and verify NIC, lcores, and hugepage memory all sit on the same socket.

What's Next

Port the tuned burst size and queue depths from testpmd into your strategy gateway or market-data process.
Add per-lcore latency histograms around rte_eth_rx_burst() and your order-path handoff so you can see where burst amplification still happens.
Benchmark with production-like packet sizes and replay shapes, not only synthetic line-rate tests.
If power saving is enabled in BIOS or the OS, move the host to a performance-oriented profile before doing final latency sign-off.

DPDK Micro-Burst Latency Tuning for HFT Systems [2026]

Bottom Line

Prerequisites

Prerequisites Box

Bottom Line

Implementation Steps

1. Reserve hugepages and isolate the right cores

2. Build DPDK cleanly and bind the NIC to vfio-pci

3. Keep NIC, memory, and lcores on one NUMA socket

4. Reduce software burst depth before touching the strategy code

Verification

Troubleshooting Top 3

What's Next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox

Related Deep-Dives

Linux CPU Isolation for Low-Latency Networking

NUMA-Aware Packet Processing: A Field Guide

VFIO vs UIO for DPDK Driver Selection