CVE-2026-9912 [Deep Dive]: GPU Shared Memory JIT Leak
Bottom Line
As of May 10, 2026, CVE-2026-9912 does not have a public record on CVE.org or NVD, so treat the identifier as unverified. The security lesson still stands: uninitialized GPU shared or local memory combined with JIT behavior can become a real cross-boundary disclosure risk.
Key Takeaways
- ›No public CVE.org or NVD entry for CVE-2026-9912 was visible on May 10, 2026
- ›The risk class is plausible because GPU shared or local memory is performance-critical and easy to misuse
- ›PTX can be JIT-compiled by the driver, which makes compiler behavior part of the attack surface
- ›OpenCL already exposes memory-initialization controls; CUDA offers sanitizer checks for shared-memory reads
- ›For high-assurance workloads, explicit zeroing and tenant isolation beat assumptions about compiler intent
A claimed issue tracked as CVE-2026-9912 describes information leakage from GPU shared memory after JIT-compiler optimization. The catch is important: as of May 10, 2026, there is no public record for that identifier on CVE.org or NVD. That does not make the underlying class imaginary. It makes this a useful case study in how accelerator runtimes, compiler passes, and scratchpad memory semantics can quietly turn a correctness bug into a tenant-isolation problem.
CVE Summary Card
Bottom Line
The identifier CVE-2026-9912 is not publicly documented as of May 10, 2026, but the claimed failure mode is credible enough that GPU teams should audit it now: if stale shared memory survives and an optimizer removes or weakens an intended clear, secrets can cross trust boundaries.
- Identifier:
CVE-2026-9912 - Public status on May 10, 2026: no visible public record on CVE.org or NVD
- Claimed issue type: information disclosure via GPU shared or local memory
- Suspected trigger: JIT optimization interacting with incomplete initialization or divergent control flow
- Likely impact if real: disclosure of remnants from prior work-items, blocks, kernels, or tenants depending on scheduler and isolation boundaries
- What is verified from official docs: PTX can be JIT-compiled by the driver, CUDA shared memory persists for kernel execution, OpenCL exposes optional local-memory initialization controls, and NVIDIA tooling can inspect shared-memory initialization issues
The verified background matters. NVIDIA's CUDA documentation states that PTX may be JIT-compiled at application load time by the device driver, and the CUDA programming guide describes shared memory as a per-block scratchpad physically located on each SM. Khronos also documents clkhrinitialize_memory, an OpenCL extension specifically meant to initialize local and private memory before kernel execution. Those are not rumors; they are official design surfaces that make this class of bug plausible.
Vulnerable Code Anatomy
Why this class is dangerous
GPU programmers often treat shared memory as a fast temporary cache. Security breaks appear when code assumes one of these conditions without enforcing it:
- Every element of the shared array is written before any thread reads it.
- The compiler will preserve a defensive clear because the programmer wrote one.
- Divergent control flow cannot expose stale bytes from inactive lanes.
- Per-kernel scratch state cannot become visible outside the intended trust boundary.
The official CUDA guide emphasizes that shared memory is accessible by all threads in a block and that developers must avoid data races with __syncthreads(). That is a correctness statement, but in multi-tenant systems it becomes a security statement too.
A representative buggy pattern
__global__ void score_kernel(const uint32_t* secret, uint32_t* out, int live) {
__shared__ uint32_t tile[256];
int tid = threadIdx.x;
// Only some lanes write.
if (tid < live) {
tile[tid] = secret[tid];
}
__syncthreads();
// Later logic assumes the whole tile is initialized.
if (tid < 256) {
out[tid] = tile[tid];
}
}
This is not an exploit and not a working PoC. It is the minimal anatomy of the problem. If live is smaller than the tile width, part of tile is never initialized. If a cleanup or zero-fill path was intended but later optimized away, the kernel can export stale bytes that were never meant to be observable.
Where the compiler enters the picture
A JIT compiler becomes relevant when developers rely on a pattern like this:
- A pre-clear loop is emitted but only under a predicate the optimizer later proves redundant.
- A memset-like initialization is hoisted, merged, or removed because later stores appear to dominate all reads.
- One architecture path is ahead-of-time compiled while another falls back to PTX and is optimized by a newer driver.
- Control-flow simplification changes which lanes are considered active before a barrier.
That last point is what makes JIT behavior operationally uncomfortable: the same source kernel may pass review, then behave differently after a driver update because the compiler is part of the deployed product.
Attack Timeline
What can actually be verified
- May 10, 2026: no public CVE.org record was visible for
CVE-2026-9912. - May 10, 2026: no public NVD detail entry was visible for
CVE-2026-9912. - May 10, 2026: no matching vendor advisory was found in official CUDA or Khronos references using that identifier.
- May 10, 2026: the underlying technical backdrop is public and current: driver-side PTX JIT is documented, shared memory behavior is documented, and OpenCL memory-initialization controls are documented.
That means the public timeline is not an exploit timeline yet. It is a verification timeline. For defenders, that distinction matters. A missing public record should not trigger complacency, but it should stop teams from repeating unsourced severity claims, invented version ranges, or fake vendor guidance.
Exploitation Walkthrough
Prerequisites an attacker would likely need
- Ability to submit or influence GPU kernels in a shared runtime, service, browser, ML platform, or plugin model.
- Reuse of the same physical GPU resources across trust boundaries without strong scrubbing guarantees.
- A code path where shared or local memory is only partially initialized before readback.
- A compilation path where JIT optimization changes the intended initialization behavior.
- A way to move leaked bytes into global memory, host-visible buffers, logs, or model outputs.
Conceptual attack chain
- An attacker identifies a kernel that uses shared memory as a tile, staging buffer, reduction cache, or temporary key schedule.
- The attacker forces an execution shape where only part of the shared allocation is written, often by manipulating dimensions, predicates, or sparse work.
- The runtime compiles the PTX on load, and an optimization pass removes, narrows, or reorders a defensive initialization that the developer thought was guaranteed.
- A later read copies the partially stale scratchpad into a host-visible result buffer.
- The attacker repeats the query to sample remnants and correlate bytes with previous workloads.
In a real incident, the practical target would be whatever high-value data briefly touches the GPU scratchpad: prompt fragments, embedding chunks, decrypted tokens, intermediate tensor slices, or application-specific secrets. That is why the blast radius depends less on the elegance of the bug and more on deployment architecture.
If your platform allows untrusted tenant code to run on the same accelerator fleet as sensitive inference or data-processing jobs, a shared-memory disclosure bug is not just a kernel bug. It is an isolation failure. Teams handling reproduction artifacts should also sanitize traces, dumps, and copied payload fragments before sharing them; TechBytes' Data Masking Tool is useful for redacting customer fields during incident handling.
Hardening Guide
Code-level fixes
- Initialize the full shared allocation on every path, not just the lanes expected to be live.
- Place initialization before the first read and follow it with a synchronization primitive such as __syncthreads() where required.
- Avoid reading inactive or tail elements from tiles, reductions, and staging buffers.
- Prefer explicit bounds-aware loops over clever control flow that depends on optimizer reasoning.
- Review dead-store elimination risk whenever a clear is written only for safety, not for visible program output.
Compiler and runtime controls
- For high-assurance paths, favor reproducible ahead-of-time binaries over relying purely on forward-compatible PTX.
- Pin driver and toolkit versions in production where security-sensitive kernels are involved.
- Re-test kernels after every driver upgrade that can change JIT behavior.
- On OpenCL stacks, evaluate clkhrinitialize_memory and
CL_CONTEXT_MEMORY_INITIALIZE_LOCAL_KHRwhere supported.
Detection and validation
- Run NVIDIA Compute Sanitizer with --tool initcheck --initcheck-address-space shared during CI for kernels that use shared memory heavily.
- Use racecheck and synccheck alongside init checks, because race conditions often mask as intermittent disclosure bugs.
- Maintain regression tests that vary block sizes, sparsity, tail lengths, and divergent branches rather than just happy-path tensor shapes.
- Audit logs and outputs for low-entropy fragments or repeated cross-request residue that may indicate stale scratchpad export.
Deployment controls
- Separate untrusted GPU workloads from sensitive ones at the device, node, or pool level whenever possible.
- Reduce code-injection surfaces that let users upload arbitrary kernels, custom ops, or unsafe plugins.
- Treat accelerator fleets as shared compute with memory-hygiene obligations, not as opaque coprocessors.
Architectural Lessons
What this case teaches even without a public advisory
- Scratchpad memory is a security boundary. Shared and local memory are often described as performance primitives, but in multi-tenant systems they also hold sensitive intermediate state.
- Undefined initialization is not a minor bug. On CPUs it may surface as flaky correctness; on GPUs it can become structured cross-request disclosure.
- JIT compilers expand the trusted surface. NVIDIA documents that PTX can be compiled by the driver at runtime, so compiler behavior is part of operational risk management.
- Portability features can complicate assurance. Shipping PTX improves forward compatibility, but it also means security-relevant code generation can change after deployment.
- Memory initialization should be explicit policy. Khronos exposing an extension for local-memory initialization is a signal that this is important enough to standardize.
The practical takeaway is straightforward. Even though CVE-2026-9912 is not publicly published as of May 10, 2026, engineering teams should use the claim as a forcing function to review shared-memory assumptions, inspect their PTX JIT paths, and decide which workloads truly deserve stronger accelerator isolation. Waiting for a perfect advisory is the wrong threshold when the memory model already tells you where the edge is.
For teams that want to review the official architecture backdrop, the relevant starting points are NVIDIA's CUDA programming guide on shared memory, NVIDIA's documentation on driver-side PTX JIT compilation, and Khronos documentation for clkhrinitialize_memory.
Frequently Asked Questions
Is CVE-2026-9912 publicly listed anywhere? +
CVE-2026-9912. That means the identifier should be treated as unverified until a CNA or vendor publishes a record.Can GPU shared memory really leak data across workloads? +
Why does JIT compilation matter for a memory leak bug? +
How do I test for uninitialized shared-memory reads in CUDA? +
compute-sanitizer --tool initcheck --initcheck-address-space shared against kernels that rely on shared memory. Pair that with racecheck and synccheck, because synchronization bugs often hide the same root cause.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.