DPDK Micro-Burst Latency Tuning for HFT Systems [2026]
Bottom Line
In DPDK-based trading paths, micro-burst latency usually falls when you stop optimizing only for peak throughput. Smaller software bursts, NUMA-local memory, isolated polling cores, and VFIO-backed NIC binding remove the most common queueing spikes first.
Key Takeaways
- ›DPDK 25.11 docs still show --burst=32 as the default in testpmd.
- ›Use vfio-pci by default; DPDK recommends it over UIO for bound ports.
- ›Reserve 1 GB hugepages when the platform supports them for 64-bit apps.
- ›Keep NICs, hugepage memory, and polling cores on the same NUMA socket.
- ›Start tuning with smaller bursts like 4 or 8, then re-measure tail latency.
Micro-burst latency is where many high-frequency trading stacks lose determinism: not during steady-state throughput, but when a few dozen packets arrive faster than a busy core can drain them. DPDK gives you the tools to remove that jitter, but the defaults are not optimized for HFT tail latency. This tutorial shows a practical, source-backed path to reduce burst amplification using current DPDK 25.11 guidance, vfio-pci, hugepages, CPU isolation, NUMA locality, and smaller poll bursts.
- Default matters: testpmd still defaults to --burst=32, which is often too throughput-oriented for HFT tails.
- Placement matters: DPDK recommends NUMA-aware allocation and same-socket NIC/core placement.
- Isolation matters: Linux can still schedule timers, IRQs, and RCU work onto your polling cores unless you isolate them.
- Memory matters: DPDK recommends 1 GB hugepages for 64-bit apps when supported.
Prerequisites
Prerequisites Box
- Linux host with kernel >= 5.4.
- A DPDK-supported NIC dedicated to the trading or market-data path.
- BIOS access to enable IOMMU or VT-d when using vfio-pci.
- Build tools required by DPDK: GCC 8+ or Clang 7+, Python 3.6+, Meson 0.57+, ninja, and NUMA development libraries.
- Root access for hugepage reservation and NIC rebinding.
- A baseline replay or market-data generator so you can compare p99 and p99.9 latency before and after each change.
Bottom Line
For HFT, reduce queueing before you chase peak packets per second. The fastest fix is usually a combination of vfio-pci, isolated polling cores, NUMA-local memory, and a smaller --burst setting than the default.
Implementation Steps
1. Reserve hugepages and isolate the right cores
DPDK's Linux guide states that hugepages are required for large packet-buffer pools, and recommends 1 GB hugepages for 64-bit applications when the platform supports them. Its core-isolation section also calls out isolcpus, nohz_full, and irqaffinity to keep scheduler noise off polling cores.
- Edit your kernel command line so hugepages and isolated cores are configured at boot.
- Keep core 0 out of the polling set; DPDK explicitly notes it cannot be fully isolated.
- Pin only your low-latency threads to those isolated CPUs.
# /etc/default/grub
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=4 isolcpus=2,4,6 nohz_full=2,4,6 irqaffinity=0,1,3,5,7 iommu=on"
sudo update-grub
sudo reboot
# After reboot
grep -E 'HugePages_Total|Hugepagesize' /proc/meminfoIf your platform does not support 1 GB pages, reserve 2 MB pages instead. On multi-socket hosts, reserve pages on the same NUMA node as the NIC you will use.
2. Build DPDK cleanly and bind the NIC to vfio-pci
DPDK's Linux driver guide recommends vfio-pci for DPDK-bound ports in all cases. Build a known-good version first, then move the NIC away from the kernel network driver.
tar xJf dpdk-25.11.1.tar.xz
cd dpdk-25.11.1
meson setup -Dplatform=native build
cd build
ninja
sudo meson install
sudo ldconfig
sudo modprobe vfio-pci
sudo ../usertools/dpdk-devbind.py --status
sudo ../usertools/dpdk-devbind.py --bind=vfio-pci 0000:82:00.0Use -Dplatform=native only on hosts where you control the deployment target. If you are packaging a portable artifact, switch to -Dplatform=generic. If you adapt the shell snippets for your runbooks, the Code Formatter helps keep command examples readable and consistent.
3. Keep NIC, memory, and lcores on one NUMA socket
DPDK's Intel platform performance guide is blunt here: to get the best performance, ensure the cores and NIC are on the same socket. For latency, this is not optional. Remote memory and cross-socket traffic add variance even when average throughput still looks fine.
# Find the NIC and its NUMA node
lspci -nn | grep Eth
cat /sys/bus/pci/devices/0000:82:00.0/numa_node
# Example DPDK launch pattern with explicit lcores and NUMA memory
sudo dpdk-testpmd \
--lcores='1@2,2@4,3@6' \
--main-lcore=1 \
--numa-mem 1024,0 \
-a 0000:82:00.0 \
-- \
--port-topology=loop \
--forward-mode=io \
--rxq=3 --txq=3The exact socket map differs by server, so verify it instead of assuming PCI slot placement. NUMA mismatches are one of the easiest ways to create bursty tail spikes that disappear under light load and reappear in production.
4. Reduce software burst depth before touching the strategy code
This is the high-value tuning step. In current DPDK documentation, testpmd defaults to --burst=32, --rxd=128, and --txd=512. Those defaults are sensible for general forwarding, but HFT paths often prefer lower queue occupancy over maximum bulk efficiency. A smaller burst drains packets sooner, which usually improves p99 and p99.9 latency at the cost of some peak throughput.
# Baseline with documented defaults
sudo dpdk-testpmd \
--lcores='1@2,2@4,3@6' \
--main-lcore=1 \
--numa-mem 1024,0 \
-a 0000:82:00.0 \
-- \
--forward-mode=io \
--rxq=3 --txq=3 \
--rxd=128 --txd=512 \
--burst=32 --stats-period=1
# Latency-oriented run
sudo dpdk-testpmd \
--lcores='1@2,2@4,3@6' \
--main-lcore=1 \
--numa-mem 1024,0 \
-a 0000:82:00.0 \
-- \
--forward-mode=io \
--rxq=3 --txq=3 \
--rxd=64 --txd=128 \
--burst=4 --stats-period=1For your own application, the same principle usually means capping the receive loop to a small batch and transmitting sooner instead of waiting to accumulate work:
enum { RX_BURST = 4 };
struct rte_mbuf *pkts[RX_BURST];
const uint16_t n = rte_eth_rx_burst(port_id, queue_id, pkts, RX_BURST);
for (uint16_t i = 0; i < n; i++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts[i], void *));
/* parse, decide, enqueue or send immediately */
}
if (n) {
rte_eth_tx_burst(port_id, txq, pkts, n);
}This is an engineering inference rather than a quoted DPDK recommendation, but it follows directly from how software queueing amplifies micro-bursts: waiting to fill a larger batch improves amortization and hurts time-to-first-packet.
Verification
Do not accept the change because CPU utilization dropped or throughput stayed flat. The point is lower tail latency under bursty load.
- Confirm hugepages are available and correctly sized.
- Confirm the NIC is bound to vfio-pci.
- Confirm the NIC's NUMA node matches the cores and memory you allocated.
- Replay the same burst profile for each run: 32, 16, 8, 4.
- Record p50, p99, p99.9, drops, and CPU headroom for each profile.
# Hugepages
grep -E 'HugePages_Total|Hugepagesize' /proc/meminfo
# NIC binding
sudo ../usertools/dpdk-devbind.py --status
# NUMA locality
cat /sys/bus/pci/devices/0000:82:00.0/numa_nodeExpected output looks like this:
HugePages_Total: 4
Hugepagesize: 1048576 kB
0000:82:00.0 'Ethernet controller ...' drv=vfio-pci unused=ixgbe
0If your tuning worked, you should see lower tail latency during burst replay with no unexpected increase in drops. Throughput may stay the same or dip slightly; for HFT, that trade is often correct.
Troubleshooting Top 3
- Problem: DPDK cannot use the NIC. Cause: The interface is still attached to the kernel driver or the wrong IOMMU group is partially bound. Fix: Recheck dpdk-devbind.py --status, load vfio-pci, and verify all required devices in the IOMMU group are handled correctly.
- Problem: Latency is unstable even after reducing --burst. Cause: Linux is still scheduling work on polling cores. Fix: Revisit isolcpus, nohz_full, and irqaffinity, and keep core 0 out of the dataplane.
- Problem: Average latency improves but p99.9 gets worse under load. Cause: Descriptor counts are still too deep, or NUMA locality is broken. Fix: Try smaller --rxd and --txd values, and verify NIC, lcores, and hugepage memory all sit on the same socket.
What's Next
- Port the tuned burst size and queue depths from testpmd into your strategy gateway or market-data process.
- Add per-lcore latency histograms around
rte_eth_rx_burst()and your order-path handoff so you can see where burst amplification still happens. - Benchmark with production-like packet sizes and replay shapes, not only synthetic line-rate tests.
- If power saving is enabled in BIOS or the OS, move the host to a performance-oriented profile before doing final latency sign-off.
Frequently Asked Questions
What DPDK setting usually affects micro-burst latency first? +
--burst=32; that often favors throughput over HFT tail latency. Test 4, 8, and 16 under the same replay profile and compare p99 and p99.9.Should I use vfio-pci or UIO for low-latency DPDK? +
Do hugepages reduce latency or just improve throughput? +
Why does NUMA alignment matter so much in trading systems? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Linux CPU Isolation for Low-Latency Networking
A practical guide to isolcpus, nohz_full, irqaffinity, and scheduler noise control.
Cloud InfrastructureNUMA-Aware Packet Processing: A Field Guide
How to place NICs, queues, memory, and worker threads on the right socket.
Developer ReferenceVFIO vs UIO for DPDK Driver Selection
Security, compatibility, and performance tradeoffs when binding NICs for userspace I/O.