Home Posts CVE-2026-1104: NPU Side-Channel Attack Deep Dive [2026]
Security Deep-Dive

CVE-2026-1104: NPU Side-Channel Attack Deep Dive [2026]

CVE-2026-1104: NPU Side-Channel Attack Deep Dive [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · April 16, 2026 · 8 min read

On March 12, 2026, Ananya Krishnamurthy at Horizon Security Labs disclosed CVE-2026-1104, a side-channel vulnerability in shared Neural Processing Unit (NPU) inference infrastructure. The flaw allows a co-tenant in a multi-tenant NPU cluster to reconstruct proprietary model weights — or intercept inference outputs from neighbouring workloads — by precisely timing memory bus contention events on the shared NPU fabric. As hyperscale cloud providers commoditise NPU-backed inference, this class of vulnerability deserves the same structural attention that CPU-level attacks like Spectre and Meltdown received in 2018.

CVE-2026-1104 — Summary Card

CVSS Score8.6 HIGH
VectorAV:N/AC:H/PR:L/UI:N/S:C/C:H/I:N/A:N
CWECWE-1264 — Hardware Logic With Insecure De-Synchronization
AffectedHorizonAI CloudAccel v3.x, NeuralEdge Fabric 2.4–2.7
DisclosedMarch 12, 2026 (90-day coordinated disclosure)
Patch StatusPartial — firmware update available; full isolation requires hardware revision

Key Takeaway

NPU multi-tenancy introduces the same class of timing side-channel risk that plagued CPUs after Spectre/Meltdown — but NPU isolation primitives are far less mature. Until hardware vendors ship fully partition-fenced NPU designs, dedicated tenancy is the only provably safe deployment model for proprietary AI workloads. Firmware mitigations buy time; they do not close the channel.

Background: NPUs in the Multi-Tenant Cloud

Modern cloud NPUs — purpose-built silicon for matrix multiplication and tensor operations — are engineered for throughput density, not tenant isolation. A single NPU die may host 16 or more concurrent inference sessions across different customers, with on-chip SRAM, memory controllers, and DMA engines shared among all tenants. The NPU scheduler multiplexes compute across these workloads in microsecond-scale time slices, maximising hardware utilisation at the expense of strict microarchitectural partitioning.

This design creates shared observable state: memory bus arbitration queues, prefetch buffer eviction patterns, and DMA latency are all measurable by any tenant with access to a high-resolution timer. When a neighbouring tenant loads a large model shard, the resulting bus contention produces a measurable latency spike — a signal that, with sufficient sampling, correlates with specific memory access patterns and ultimately with model structure.

CPU architects learned this lesson the hard way with Spectre (2018) and its descendants. NPU architects are now learning it again, but without the decade of hardening work that x86 and ARM benefited from.

Vulnerable Code Anatomy

CVE-2026-1104 originates in the NeuralEdge Fabric Driver (nef_driver.ko), specifically in the routine that maps hardware performance counters into tenant user-space. The driver exposes a high-precision bus utilisation counter via a shared memory page mapped read-only into each tenant's virtual address space:

/* nef_driver.c — vulnerable context-mapping routine */
static int nef_map_perf_page(struct nef_tenant_ctx *ctx,
                             struct vm_area_struct *vma)
{
    /* Map the NPU bus perf counter page read-only into tenant VAS */
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    return remap_pfn_range(
        vma,
        vma->vm_start,
        ctx->npu->shared_perf_pfn,   /* BUG: single PFN for entire NPU die */
        PAGE_SIZE,
        vma->vm_page_prot
    );
}

The defect: shared_perf_pfn resolves to a single physical page that accumulates NPU memory bus utilisation counters for the entire die — not per-tenant counters. Every mapped tenant reads nanosecond-resolution bus occupancy data reflecting the aggregate traffic of all co-located workloads. The counter is not virtualised, partitioned, or access-controlled beyond the read-only mapping.

A secondary vulnerability compounds the issue: the driver's context-switch path does not flush the NPU's on-chip SRAM prefetch buffer between tenant time slices. Residual activation patterns from a previous tenant's forward pass can bleed into the new tenant's observable state via cache-timing differences — a secondary oracle orthogonal to the bus counter leak.

The root cause in both cases is the same: the driver was designed when NPUs were used exclusively in single-tenant environments. The multi-tenant deployment model arrived later, without a corresponding security review of the performance monitoring interface.

Attack Timeline

The following sequence describes how CVE-2026-1104 was exercised in the researcher's proof-of-concept environment. No working exploit code is presented here; the walkthrough is conceptual and sanitised for responsible disclosure.

  1. Co-location acquisition (T−30 min): The attacker provisions a low-cost spot inference workload in the same cloud region as the target, submitting repeated reservation requests until latency fingerprinting confirms placement on the same physical NPU die as the victim tenant.
  2. Baseline collection (T+0 to T+30 min): The attacker workload submits idle inference requests at maximum rate, recording the shared_perf_pfn bus occupancy counter at 1 µs resolution during its own compute windows. This establishes a baseline distribution of idle bus noise.
  3. Reference library construction (T+30 to T+60 min): Deliberately crafted input tensors are submitted to the attacker's own model while bus latency is sampled. This builds a reference library mapping bus contention signatures to specific layer dimensions and weight matrix shapes across known architectures.
  4. Victim inference triggering (T+60 to T+90 min): The attacker repeatedly triggers inference calls against the target's public API endpoint, correlating the resulting bus contention spikes against the reference library.
  5. Weight reconstruction (T+90 min onward): After approximately 8,000 correlation samples, the attacker reconstructs the target model's architecture with ~91% accuracy on layer width and depth, and recovers statistical distributions of weight matrices sufficient to mount a downstream model-extraction attack.

Exploitation Walkthrough (Conceptual)

The core primitive is a timing oracle. Because the shared performance counter reflects cross-tenant bus traffic, the attacker can distinguish between a neighbour loading a small embedding layer (low bus pressure: <20% utilisation spike lasting ~5 µs) versus a large multi-head attention block (sustained >75% spike across 40–120 µs). The granularity is sufficient to identify individual transformer layers by their characteristic memory bandwidth profile.

Critically, this attack requires no kernel exploit, no privilege escalation, and no network interception. It operates entirely through the legitimate, documented performance counter API. Standard security tooling — EDR agents, network monitors, anomaly detectors — has no visibility into the attack channel. From the platform's perspective, the attacker is running an ordinary inference workload.

Reconstruction fidelity depends on three factors:

  • Sample count: More inference triggers produce sharper weight distribution estimates. Public API endpoints with no rate limiting are ideal targets — the attacker can accumulate thousands of samples per hour at negligible cost.
  • Architecture complexity: Transformer models with large KV-cache layers produce the most distinctive bus signatures. Convolutional networks are significantly harder to fingerprint due to their more uniform memory access patterns.
  • Co-location persistence: The longer attacker and victim share the same NPU die, the more samples accumulate. Cloud spot-instance bidding strategies can sustain co-location for hours across multiple reservation windows.

If your organisation exposes inference APIs over proprietary models, the metadata embedded in inference response timings, output distributions, and error codes constitutes a meaningful leakage surface. Our Data Masking Tool provides a useful framework for auditing which categories of inference metadata carry leakage risk before they reach an external consumer.

Hardening Guide

Effective mitigation requires a layered approach. No single control fully closes CVE-2026-1104 without a hardware revision, but the following stack substantially degrades the attacker's oracle and increases time-to-reconstruct beyond practical thresholds for most threat actors.

Driver and Firmware Level

  • Apply NeuralEdge Fabric Driver 2.7.4+: The patch replaces the shared shared_perf_pfn page with virtualised per-tenant performance counter registers, eliminating the primary cross-tenant timing oracle. Requires a driver module reload and NPU scheduling restart — no full node reboot required on patched kernels.
  • Enable SRAM flush on context switch: Set firmware flag NEF_CTX_FLUSH_SRAM=1 in the NPU device configuration. This zeroes on-chip SRAM banks between tenant time slices, closing the secondary activation bleed channel. Measured performance cost: 3–7% throughput reduction at P99 latency under mixed workloads.
  • Disable the legacy perf-counter mmap interface: If your workload does not require user-space hardware performance monitoring, set nef_perf_mmap_enabled=0 in module parameters to remove the attack surface entirely. Verify no internal tooling depends on this interface before disabling.

Platform and Cloud Deployment Level

  • Dedicated NPU tenancy: Reserve single-tenant NPU capacity (available as a premium SKU on major hyperscalers as of Q1 2026). This is the only complete mitigation until hardware-level partition fencing ships. For workloads where model IP is a competitive moat, the cost premium is justified.
  • Inference rate limiting: Implement per-client rate limits on public inference API endpoints. Krishnamurthy's analysis shows that a 10 RPS cap increases attacker time-to-reconstruct by approximately 40× — not a full mitigation, but a significant friction increase against opportunistic attacks.
  • Output perturbation: Inject calibrated noise into inference outputs — for example, randomised rounding on logit vectors or top-k truncation with added temperature. This disrupts timing correlation between bus contention patterns and model structure without meaningfully degrading accuracy for most downstream tasks.

Monitoring and Detection

  • Alert on anomalously high sustained inference throughput from a single tenant on shared NPU nodes — the volume of requests needed to build a reference library is a detectable signal if baseline rates are established.
  • Track co-location persistence: flag tenants that repeatedly co-locate with the same high-value inference endpoints across multiple reservation windows. Legitimate workloads rarely exhibit this pattern.
  • Review access logs for timing-correlation patterns — unusually uniform inter-request intervals at maximum supported rate are characteristic of automated sampling workloads rather than production traffic.

Architectural Lessons

CVE-2026-1104 follows a pattern the security community has traced through hardware generations. A microarchitectural resource designed for performance observability becomes an information channel the moment it is shared across trust boundaries without virtualisation. Spectre exploited CPU branch predictor state; RowHammer exploited DRAM refresh timing; now NPU memory bus counters join the same lineage.

For architects designing or procuring AI inference infrastructure, the lessons are structural:

  • Treat every performance telemetry interface as a security surface. Any counter that reflects aggregate hardware state across tenants is a potential side-channel. Per-tenant virtualisation of performance monitoring must be a first-class requirement in NPU design specifications, not a retrofit.
  • Assume isolation boundaries are porous until formally verified. Hardware security primitives for NPUs — the equivalent of CPU SMEP, SMAP, or VT-x — are still maturing. Compensate at the software and deployment layer until silicon catches up.
  • Model confidentiality is an infrastructure concern, not just an API concern. Even with a fully locked-down inference API, physical co-location leaks structural information about proprietary models. Architecture reviews for AI systems must include NPU deployment topology alongside application-layer controls.
  • Responsible disclosure pipelines for hardware firmware are too slow. The 90-day disclosure window that Horizon Security Labs followed is standard, but NPU vendor patch pipelines remain operationally complex compared to software deployments. Hardware OEMs must invest in faster, operationally simpler firmware update mechanisms for deployed AI accelerators.

The industry is responding: the NPU Security Working Group, formed under the Open Compute Project in January 2026, has published a draft specification requiring mandatory per-tenant hardware partitioning in cloud NPU designs. Compliant silicon is not expected before late 2027 at the earliest.

Until then, defence-in-depth — dedicated tenancy for sensitive workloads, firmware patching, inference rate limiting, and output perturbation — remains the only realistic posture for operators with genuine model IP at stake. The attack surface will not shrink until the hardware does the work. Patch now, architect for isolation later.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.