PyTorch in the Browser with WASM and WebGPU [2026]
Bottom Line
For browser inference in 2026, treat PyTorch as the authoring framework and ONNX Runtime Web as the runtime. Export once, prefer WebGPU, and always ship a WASM fallback.
Key Takeaways
- ›You do not ship a raw .pt file to the browser; export from PyTorch first.
- ›Use onnxruntime-web/webgpu and fall back to wasm for broad coverage.
- ›navigator.gpu requires a secure context, so local file tests are misleading.
- ›Static shapes plus enableGraphCapture can improve repeat-run WebGPU performance.
- ›Keep browser inputs small and fixed-size before chasing model-level optimizations.
Running a PyTorch model in the browser is absolutely practical in 2026, but the architecture matters. There is still no mainstream path where a browser directly loads a raw .pt file and executes it like Python does. The durable setup is simpler: author and test the model in PyTorch, export it, and run it client-side with a browser runtime that can target WebGPU and fall back to WASM. This walkthrough shows the full path end to end.
- You will export from PyTorch 2.x, not deploy Python itself.
- The browser runtime layer is ONNX Runtime Web, with WebGPU plus WASM.
- Use static or bounded shapes whenever possible to keep browser inference predictable.
- Ship a secure HTTPS origin, or
navigator.gpumay not exist at all.
Prerequisites
What you need before you start
- A local Python environment with PyTorch and an ONNX verification runtime.
- A small inference model first, ideally image classification or a compact embedding model.
- A modern Chromium-based browser for the WebGPU path and any current browser for the WASM fallback.
- A local dev server. Do not test this from
file://. - A JavaScript or TypeScript frontend build. If you want to clean up pasted snippets before publishing, TechBytes' Code Formatter is useful for keeping browser and Python examples consistent.
Bottom Line
The practical browser stack is PyTorch for authoring, ONNX for interchange, and ONNX Runtime Web for execution. Prefer WebGPU for throughput, but keep WASM in the provider list so the app still works when GPU support is missing.
Step 1: Export the PyTorch model
Start by exporting a model that already runs correctly in Python. Keep the first pass boring: fixed input size, eval() mode, and a single forward method. PyTorch's current ONNX exporter supports dynamo=True by default, and verify=True can validate the result with ONNX Runtime during export.
1. Install the export-side dependencies
python -m venv .venv
source .venv/bin/activate
pip install torch torchvision onnx onnxruntime2. Export a minimal image model
import torch
from torchvision.models import mobilenet_v3_small
model = mobilenet_v3_small(weights=None).eval()
example = torch.randn(1, 3, 224, 224)
onnx_program = torch.onnx.export(
model,
(example,),
input_names=['input'],
output_names=['logits'],
dynamo=True,
verify=True,
)
onnx_program.save('public/models/mobilenet.onnx')This is the key design boundary. PyTorch handles graph capture and export; the browser never needs the Python runtime. For browser targets, that separation is a feature, not a compromise.
Step 2: Wire up the browser runtime
The execution side is where WASM and WebGPU come together. ONNX Runtime Web exposes both as execution providers. The pattern you want is explicit provider selection, plus a runtime check for navigator.gpu.
1. Install the browser runtime
npm install onnxruntime-web2. Create a loader that prefers WebGPU
import * as ort from 'onnxruntime-web/webgpu';
ort.env.wasm.numThreads = 1;
ort.env.logLevel = 'warning';
export async function createSession() {
const hasWebGPU = typeof navigator !== 'undefined' && !!navigator.gpu;
const session = await ort.InferenceSession.create('/models/mobilenet.onnx', {
executionProviders: hasWebGPU ? ['webgpu', 'wasm'] : ['wasm'],
enableGraphCapture: hasWebGPU,
freeDimensionOverrides: {
batch: 1,
height: 224,
width: 224
}
});
return { session, backend: hasWebGPU ? 'webgpu' : 'wasm' };
}There are two details worth calling out:
- Import path: use
onnxruntime-web/webgpuwhen you want the WebGPU-enabled bundle. - Fallback policy: keep
'wasm'in the provider list so unsupported browsers do not hard-fail.
3. Feed an input tensor from the page
import * as ort from 'onnxruntime-web/webgpu';
import { createSession } from './session.js';
const imageSize = 224;
function toCHWFloat32(imageData) {
const { data, width, height } = imageData;
const out = new Float32Array(1 * 3 * width * height);
const area = width * height;
for (let i = 0; i < area; i++) {
out[i] = data[i * 4] / 255;
out[area + i] = data[i * 4 + 1] / 255;
out[area * 2 + i] = data[i * 4 + 2] / 255;
}
return out;
}
export async function runModel(imageData) {
const { session, backend } = await createSession();
const input = new ort.Tensor('float32', toCHWFloat32(imageData), [1, 3, imageSize, imageSize]);
const result = await session.run({ input });
return { backend, logits: result.logits.data };
}For a first implementation, keep preprocessing on the CPU. Once the model works, you can optimize data movement. ONNX Runtime Web also supports GPU-backed tensors and preferred GPU output locations, but those are second-pass improvements, not day-one requirements.
Verification and expected output
Your verification loop should prove four things: the model loads, the selected backend is correct, tensor shapes match the export contract, and repeat runs are stable.
1. Add a simple UI-level smoke test
const status = document.querySelector('#status');
try {
const output = await runModel(imageData);
console.log('backend:', output.backend);
console.log('logits length:', output.logits.length);
status.textContent = `OK: ${output.backend}, ${output.logits.length} logits`;
} catch (err) {
console.error(err);
status.textContent = `Failed: ${err.message}`;
}Expected output
- On a supported Chromium browser over HTTPS, you should see
backend: webgpu. - On unsupported browsers, you should still get a valid response through
wasm. - The logits length should match the exported model's output shape. For a 1000-class classifier, that is typically
1000.
2. Measure cold-start versus warm runs
const t0 = performance.now();
await runModel(imageData);
const t1 = performance.now();
await runModel(imageData);
const t2 = performance.now();
console.log('first run ms:', (t1 - t0).toFixed(1));
console.log('second run ms:', (t2 - t1).toFixed(1));The first run includes model fetch, initialization, and potentially graph capture. The second run is the one that tells you whether the browser deployment is actually viable.
Troubleshooting top 3
- WebGPU never activates. Check that the app is running in a secure context.
navigator.gpuis gated behind HTTPS, and browser support is still uneven. If the runtime falls back towasm, that is expected behavior on unsupported browsers. - Model export succeeds but browser inference fails. This usually means the exported graph or shapes do not match your browser inputs. Re-check
input_names, output names, and the exact tensor shape you pass intosession.run(). Fixed shapes are easier to debug than fully dynamic ones. - Performance is worse than expected. The common causes are oversized models, expensive image preprocessing on the main thread, or repeated CPU-GPU copies. Start by shrinking the model, batching less aggressively, and enabling graph capture only when the model is shape-stable.
What's next
Once the basic path works, move from correctness to deployment quality.
- Replace ad hoc preprocessing with a shared pipeline so Python validation and browser inference use the same normalization rules.
- Use smaller or distilled models before attempting heroic frontend optimizations.
- Test fixed-size exports first, then introduce dynamic dimensions only where the product actually needs them.
- Keep outputs on the GPU only if the next pipeline stage also consumes GPU buffers; otherwise the added complexity is often wasted.
- If your model files include sensitive demo data, scrub them before sharing builds internally. A utility like TechBytes' Data Masking Tool fits that workflow better than fixing the issue after distribution.
The most important mental model is this: browser ML is now a systems problem, not a novelty demo. If you keep the export boundary clean, prefer WebGPU where available, and retain a solid WASM fallback, PyTorch-authored models can ship to the browser with production-grade behavior instead of experimental fragility.
Frequently Asked Questions
Can I run a raw PyTorch .pt or .pth file directly in the browser? +
Why does my app use WASM even though I imported the WebGPU build? +
onnxruntime-web/webgpu only enables the WebGPU-capable bundle. The actual WebGPU path still depends on browser support and a secure context, because navigator.gpu may not exist otherwise. Keeping wasm in the provider list is the correct fallback strategy.How do I reduce first-load latency for browser inference? +
Should I use ExecuTorch or ONNX Runtime Web for browser deployment? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.