Home Posts WASM SIMD for HFT: Optimizing Hot Paths [Deep Dive]
System Architecture

WASM SIMD for HFT: Optimizing Hot Paths [Deep Dive]

WASM SIMD for HFT: Optimizing Hot Paths [Deep Dive]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 13, 2026 · 9 min read

Bottom Line

WASM SIMD helps only when you isolate the arithmetic loop your trading stack executes millions of times and vectorize that loop deliberately. Start with a scalar baseline, ship a dedicated SIMD build, and require exact output parity before you trust any latency gain.

Key Takeaways

  • Use -Ctarget-feature=+simd128 or #[target_feature(enable = 'simd128')]; Wasm SIMD is not runtime-detected.
  • Vectorize pure math first: pricing, markout, fee, and risk loops benefit more than branch-heavy order logic.
  • Keep a scalar tail path for non-multiple-of-4 batches and compare checksums before benchmarking.
  • For repeated runs, precompile with wasmtime compile to cut startup overhead.

WebAssembly 2.0 standardized 128-bit SIMD, which makes Wasm a serious option for tightly scoped high-frequency trading hot paths such as markout, fee, and risk-vector updates. The win is not “rewrite the matching engine in Wasm.” The win is compiling one arithmetic-heavy kernel into a portable module, proving it matches scalar output exactly, and then measuring whether your runtime actually benefits from vector lanes under production-like batch sizes.

Prerequisites

Before You Start

  • Rust installed with rustup.
  • The wasm32-wasip1 target added locally.
  • The wasmtime CLI installed for execution and AOT compilation.
  • A hot loop that is mostly arithmetic on dense arrays, not I/O or branch-heavy control flow.
  • A repeatable input set so scalar and SIMD outputs can be compared exactly.

If you want cleaner snippets for team docs or internal runbooks, TechBytes’ Code Formatter is a useful last pass before publishing shell and Rust examples.

Bottom Line

Treat Wasm SIMD like a surgical optimization. Vectorize one proven hot loop, compile a dedicated SIMD artifact, and reject the change unless scalar and SIMD outputs match exactly and end-to-end latency improves under your real batch shape.

Step 1: Pick the Right Loop

What to vectorize first

  1. Choose a loop that touches large contiguous slices such as quantities, basis-point prices, fees, or per-symbol risk deltas.
  2. Prefer fixed-width math with simple adds, subtracts, multiplies, mins, or maxes.
  3. Avoid loops dominated by branching, pointer chasing, string handling, or host calls.
  4. Measure the loop in isolation before you move it into Wasm.

For HFT systems, a common first candidate is a post-trade markout or risk update pass that multiplies quantities by basis-point prices and adds fixed per-fill adjustments. That is exactly the kind of dense, lane-friendly work WebAssembly SIMD handles well. Rust’s Wasm intrinsics map directly to the v128 type and SIMD instructions, and the Rust docs note an important deployment detail: unlike native x86_64, Wasm currently does not offer runtime SIMD detection for this path. In practice, you ship either a SIMD-enabled artifact or a non-SIMD artifact.

cargo new hft-wasm-simd
cd hft-wasm-simd
rustup target add wasm32-wasip1
cargo install wasmtime-cli

Step 2: Build a Scalar Baseline

Start with a scalar implementation and timing harness. The baseline gives you two things you cannot skip in finance: a correctness oracle and a realistic measurement floor.

use std::time::Instant;

const N: usize = 262_144;
const ITERS: usize = 200;

fn scalar_markout(qty: &[i32], px_bp: &[i32], fee_bp: &[i32], out: &mut [i32]) {
    for i in 0..qty.len() {
        out[i] = qty[i] * px_bp[i] + fee_bp[i];
    }
}

fn checksum(data: &[i32]) -> i64 {
    data.iter().map(|&x| x as i64).sum()
}

fn main() {
    let qty: Vec<i32> = (0..N as i32).map(|i| (i % 97) - 48).collect();
    let px_bp: Vec<i32> = (0..N as i32).map(|i| 10 + (i % 23)).collect();
    let fee_bp: Vec<i32> = (0..N as i32).map(|i| (i % 7) - 3).collect();
    let mut out = vec![0_i32; N];

    let start = Instant::now();
    for _ in 0..ITERS {
        scalar_markout(&qty, &px_bp, &fee_bp, &mut out);
    }
    let scalar_us = start.elapsed().as_micros();

    println!("scalar: {scalar_us} us");
    println!("checksum: {}", checksum(&out));
}

Two practical rules matter here:

  • Use deterministic synthetic data first so mismatches are obvious.
  • Keep output allocation out of the timed loop so you are measuring arithmetic, not memory churn.

Step 3: Enable Wasm SIMD

WebAssembly SIMD is 128-bit wide, so the simplest mental model is four-lane processing for i32 and f32 workloads. Rust exposes this with core::arch::wasm32 and the v128 vector type.

Add a SIMD path

#[cfg(target_arch = "wasm32")]
use core::arch::wasm32::*;

#[cfg(target_arch = "wasm32")]
#[target_feature(enable = "simd128")]
unsafe fn simd_markout(qty: &[i32], px_bp: &[i32], fee_bp: &[i32], out: &mut [i32]) {
    let lanes = 4;
    let limit = qty.len() / lanes * lanes;
    let mut i = 0;

    while i < limit {
        let q = i32x4(qty[i], qty[i + 1], qty[i + 2], qty[i + 3]);
        let p = i32x4(px_bp[i], px_bp[i + 1], px_bp[i + 2], px_bp[i + 3]);
        let f = i32x4(fee_bp[i], fee_bp[i + 1], fee_bp[i + 2], fee_bp[i + 3]);

        let r = i32x4_add(i32x4_mul(q, p), f);

        out[i] = i32x4_extract_lane::<0>(r);
        out[i + 1] = i32x4_extract_lane::<1>(r);
        out[i + 2] = i32x4_extract_lane::<2>(r);
        out[i + 3] = i32x4_extract_lane::<3>(r);
        i += lanes;
    }

    while i < qty.len() {
        out[i] = qty[i] * px_bp[i] + fee_bp[i];
        i += 1;
    }
}

Compile the SIMD artifact

RUSTFLAGS="-Ctarget-feature=+simd128" \
cargo build --release --target wasm32-wasip1

wasmtime target/wasm32-wasip1/release/hft-wasm-simd.wasm

This is the key build detail. Function-level #[target_feature(enable = 'simd128')] gives the Rust compiler permission to emit SIMD instructions for that function, while -Ctarget-feature=+simd128 enables the feature for the whole compilation unit. For a focused benchmark, using both keeps the intent obvious.

Pro tip: Keep your scalar path in the binary until parity is proven. In trading systems, a fast wrong answer is a production incident.

Wire scalar and SIMD together

let mut scalar_out = vec![0_i32; N];
let mut simd_out = vec![0_i32; N];

let start = Instant::now();
for _ in 0..ITERS {
    scalar_markout(&qty, &px_bp, &fee_bp, &mut scalar_out);
}
let scalar_us = start.elapsed().as_micros();

let start = Instant::now();
for _ in 0..ITERS {
    unsafe { simd_markout(&qty, &px_bp, &fee_bp, &mut simd_out) };
}
let simd_us = start.elapsed().as_micros();

assert_eq!(scalar_out, simd_out);
println!("scalar: {scalar_us} us");
println!("simd: {simd_us} us");
println!("checksum: {}", checksum(&simd_out));
println!("speedup: {:.2}x", scalar_us as f64 / simd_us as f64);

If you run the same kernel frequently, Wasmtime can also ahead-of-time compile the module into a .cwasm artifact:

wasmtime compile target/wasm32-wasip1/release/hft-wasm-simd.wasm
wasmtime hft-wasm-simd.cwasm

That does not change your kernel logic, but it can reduce repeated startup and compilation overhead in operational workflows where the same module is launched often.

Verification and Expected Output

Your first success condition is correctness, not speed. The scalar and SIMD arrays must match exactly for the same input. Only after that should you inspect timing.

scalar: <time in us> us
simd: <time in us> us
checksum: <stable integer for the same dataset>
speedup: >1.00x
  • Pass when the checksum is stable and assert_eq! does not fire.
  • Pass when the SIMD path is measurably faster under your production-like batch size.
  • Fail when you benchmark tiny slices and conclude SIMD is useless; lane setup costs dominate small inputs.
  • Fail when you benchmark with logging, allocations, or host calls inside the hot loop.

For deeper inspection, Wasmtime’s tooling can help you validate how a module compiles. In repeated tuning passes, that matters more than raw microbenchmark numbers because it lets you see whether your supposedly vector-friendly code actually lowered into the instruction mix you expected.

Troubleshooting and What’s Next

Top 3 issues

  1. No speedup at all. Your loop may be memory-bound, too small, or too branchy. Move to larger batches, preallocate buffers, and make sure the timed region is only arithmetic.
  2. Compile succeeds, but SIMD is not really the bottleneck. In many trading stacks, serialization, exchange gateways, or risk checks around the loop dominate total latency. Keep the benchmark isolated and then re-test end-to-end.
  3. Results differ across environments. Standard SIMD instructions are the safer starting point for financial code. Wasmtime documents that relaxed SIMD introduces operations that trade deterministic cross-architecture behavior for performance, so avoid that path until you have explicit validation criteria.
Watch out: Do not port native SIMD assumptions blindly. Emscripten’s SIMD guidance notes that Wasm SIMD is portable by design, which means some hardware-specific tricks like prefetch-style tuning or architecture-specific semantics do not translate cleanly.

What’s next

  • Move from synthetic vectors to captured production batches with the same symbol skew and burst patterns you see in live trading.
  • Split the kernel interface from the host strategy logic so you can reuse the same Wasm module in backtests, paper trading, and live risk workers.
  • Add p50, p99, and max latency measurements around the full call boundary, not just the inner loop.
  • If the kernel stays hot, explore packing more adjacent calculations into the same pass so each load does more work.

The core idea is simple: use Wasm SIMD where the math is dense, regular, and repeatedly executed. That keeps portability high, reviewability reasonable, and the performance story honest enough for a production trading system.

Frequently Asked Questions

How do I enable SIMD in Rust WebAssembly builds? +
Use RUSTFLAGS='-Ctarget-feature=+simd128' at build time, or annotate specific functions with #[target_feature(enable = 'simd128')]. For Rust intrinsics, import from core::arch::wasm32 and compile to a Wasm target such as wasm32-wasip1.
Does WebAssembly SIMD support runtime feature detection like x86? +
Not in the same way native x86_64 code often does. For current Rust Wasm SIMD workflows, you typically ship either a SIMD-enabled artifact or a non-SIMD artifact, so deployment strategy matters.
Is relaxed SIMD safe for financial systems? +
Use caution. Relaxed SIMD can improve performance options, but Wasmtime’s determinism guidance explicitly calls out that it gives up deterministic cross-architecture behavior unless you configure for deterministic execution, which is usually a poor default trade in trading systems.
Why is my Wasm SIMD benchmark slower than scalar code? +
The usual causes are tiny batch sizes, memory-bound access patterns, buffer allocation inside the timed section, or benchmarking too much host overhead. SIMD helps arithmetic density; it does not automatically fix poor data layout or noisy measurements.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.