Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]
Bottom Line
Modal is the cleanest scale-to-zero reference for serverless GPU jobs, Replicate is the fastest path to hosted model APIs, and Lambda usually wins raw per-GPU VM pricing but does not behave like true serverless.
Key Takeaways
- ›Modal lists H100 at $3.9492/hr; Replicate lists H100 at $5.49/hr.
- ›Lambda lists 1x H100 PCIe at $2.86/hr and 1x H100 SXM at $3.78/hr.
- ›Modal bills active compute time; Lambda bills while the instance is running; Replicate depends on model type.
- ›Replicate public models are often runtime-billed, but most private deployments also bill setup and idle time.
Prices verified on May 7, 2026 from official vendor pricing and documentation. The tricky part is not just the posted $/GPU-hour number: Modal, Replicate, and Lambda meter work very differently. This cheat sheet is optimized for engineering teams that need a fast reference for bursty inference, always-on deployments, and cost modeling before they commit to one platform.
- Modal H100: $3.9492/hr from $0.001097/sec.
- Replicate H100: $5.49/hr for gpu-h100.
- Lambda H100: $2.86/hr for 1x H100 PCIe, $3.78/hr for 1x H100 SXM.
- Idle billing is the real separator: Modal minimizes it, Lambda charges while running, Replicate depends on model type.
| Dimension | Modal | Replicate | Lambda | Edge |
|---|---|---|---|---|
| Service model | Serverless functions and containers | Hosted model API plus custom deployments | On-demand GPU instances and clusters | Modal for serverless purity |
| Single-GPU H100 listed price | $3.9492/hr | $5.49/hr | $2.86/hr PCIe or $3.78/hr SXM | Lambda on raw price |
| Listed A100 price | $2.4984/hr for A100 80GB | $5.04/hr for A100 80GB | $1.48/hr for A100 40GB | Modal on like-for-like 80GB |
| Listed L40S price | $1.9512/hr | $3.51/hr | Not listed on current public pricing | Modal |
| Low-end GPU option | T4 $0.5904/hr, L4 $0.7992/hr | T4 $0.81/hr | Quadro RTX 6000 $0.58/hr, A10 $0.86/hr | Mixed |
| Billing granularity | Per second actual compute time | Per second for hardware-timed runs; some official models use input or output pricing | Hourly pricing, billed in one-minute increments while running | Modal / Replicate |
| Idle billing | No idle charge for inactive serverless execution | Public models usually runtime-only; most private models bill setup, idle, and active time | Yes, while the instance is running | Modal |
| Operational access | Managed runtime | API-first, low infra control | Full VM access via SSH and JupyterLab | Lambda for control |
Note: the price winner changes with workload shape. Lambda is frequently cheapest per posted GPU-hour, but that does not make it cheapest for spiky traffic if the VM sits idle. Replicate also mixes hardware-time pricing with model-level pricing, so compare the specific model page before locking estimates.
Pricing Matrix
Bottom Line
If you need true scale-to-zero GPU execution, start with Modal. If you need the fastest path to hosted model APIs, start with Replicate. If you need full-machine control and the lowest headline $/GPU-hour, use Lambda and budget for idle time.
Common SKUs at a glance
- Modal publishes per-second GPU pricing, including B200, H200, H100, A100, L40S, A10, L4, and T4.
- Replicate publishes hardware pricing for A100 80GB, H100, L40S, and T4, plus multi-GPU variants.
- Lambda publishes instance pricing for B200, GH200, H100 PCIe, H100 SXM, A100, A10, A6000, and Quadro RTX 6000.
When To Choose Each
Choose Modal when:
- You want serverless GPU jobs without paying for idle machines.
- You are shipping Python-first inference, batch jobs, queues, or web endpoints.
- You need access to newer GPUs like H200 or B200 with simple code-level configuration.
- You want cleaner cost control for bursty traffic than a permanently running VM.
Choose Replicate when:
- You want an API for hosted models now, not a platform migration project.
- You value official models, stable model APIs, and built-in webhook patterns.
- You are comfortable paying a premium for packaging, distribution, and marketplace convenience.
- You want to expose inference quickly to web or product teams.
Choose Lambda when:
- You need full VM access, SSH, custom services, notebooks, or background daemons.
- Your jobs are long-running enough that VM-style billing is acceptable.
- You want strong raw
$/GPU-houreconomics and are willing to manage more infrastructure. - You need a bridge from experimentation to more traditional cluster or multi-node workflows.
Commands By Purpose
Shortcut: press / to focus the filter and Esc to clear it.
Keyboard shortcuts for this reference page
| Key | Action |
|---|---|
| / | Focus the live command filter |
| Esc | Clear the filter when the search box is active |
| j | Jump to the next H2 section |
| k | Jump to the previous H2 section |
Bootstrap access
pip install modal
modal token new
modal token infoRun a model on Replicate from Node.js
export REPLICATE_API_TOKEN=r8_******
npm install replicateimport Replicate from "replicate";
const replicate = new Replicate();
const [output] = await replicate.run(
"black-forest-labs/flux-schnell",
{
input: {
prompt: "An astronaut riding a rainbow unicorn, cinematic, dramatic"
}
}
);Iterate and deploy on Modal
modal run app.py::train
modal serve app.py
modal deploy --env=prod --stream-logs app.pyReplicate sync and async prediction patterns
curl -s -X POST \
-H 'Prefer: wait' \
-H "Authorization: Bearer $REPLICATE_API_TOKEN" \
-H 'Content-Type: application/json' \
-d '{"version": "5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa", "input": {"text": "Alice"}}' \
https://api.replicate.com/v1/predictionscurl -X POST \
-H "Authorization: Bearer $REPLICATE_API_TOKEN" \
-H "Cancel-After: 1m30s" \
-H "Content-Type: application/json" \
-d '{"input": {"prompt": "The sun rises slowly between tall buildings."}}' \
https://api.replicate.com/v1/predictionsAccess a Lambda instance
ssh -i '<SSH-KEY-FILE-PATH>' ubuntu@<INSTANCE-IP>
ssh -L <LOCAL-PORT>:127.0.0.1:<REMOTE-PORT> ubuntu@<INSTANCE-IP>Move data into Lambda
rsync -av --info=progress2 <FILES> <USERNAME>@<SERVER-IP>:<REMOTE-PATH>s5cmd ls s3://<S3-BUCKET>
rclone ls my-remote:example-bucket
rclone -P copy <REMOTE>:<BUCKET-NAME> <LOCAL-DIR>Configuration
Modal profiles, environments, and secrets
- Authentication can be stored with
modal token setor environment variables likeMODAL_TOKEN_IDandMODAL_TOKEN_SECRET. - Environment scoping is first-class: create one with
modal environment create devand select it withmodal config set-environment devorMODAL_ENVIRONMENT. - Secrets can be created with the CLI and injected into functions.
modal environment create dev
modal config set-environment dev
modal secret create api-creds OPENAI_API_KEY=sk-******
Replicate token hygiene
- Replicate API tokens are 40-character secrets that begin with r8_.
- Store them in environment variables, not source files.
- If you paste examples into docs or tickets, scrub them first with TechBytes' Data Masking Tool.
export REPLICATE_API_TOKEN=r8_******
Lambda account setup
- API keys authenticate Lambda Cloud API operations.
- SSH keys are required before launching instances you want to access directly.
- By default, Lambda allows only incoming ICMP and TCP/22 unless you change firewall rules.
- Attached filesystems mount at
/lambda/nfs/<FILESYSTEM_NAME>.
echo '<PUBLIC-KEY>' >> ~/.ssh/authorized_keys
cat ~/.ssh/authorized_keys
Advanced Usage
Modal: pick GPU type in code
Modal keeps GPU selection close to the workload definition, which is useful when you want code review to capture cost-sensitive changes.
import modal
app = modal.App()
image = modal.Image.debian_slim().pip_install("torch")
@app.function(gpu="H100:2", image=image)
def run():
import torch
print(torch.cuda.is_available())
- Modal supports appending :n to request multiple GPUs per container.
- The current docs list GPU values including T4, L4, A10, L40S, A100, H100, H200, and B200.
Replicate: use sync for latency, async for orchestration
- Use Prefer: wait for short-lived inference where you want the output in one request.
- Use default async mode plus webhooks for background jobs.
- Use Cancel-After to cap runaway cost on slow generations.
- Remember that official models may use predictable per-output pricing instead of hardware-time pricing.
Lambda: treat storage and connectivity as first-class concerns
- Single-GPU instances usually launch in 3-5 minutes; multi-GPU instances usually take 10-15 minutes.
- Billing starts after the instance boots and passes health checks, then continues while the instance runs.
- Use rsync, rclone, or s5cmd for data movement. The Filesystem S3 Adapter adds S3-compatible tooling support in select regions.
- If you only need an internal service exposed locally, prefer an SSH tunnel over opening public ports.
ip -4 -br addr show enp5s0
nmap -Pn INSTANCE-IP-ADDRESS
Sources And Notes
- Modal Pricing
- Modal GPU Docs
- Modal CLI: run
- Modal CLI: deploy
- Modal CLI: serve
- Modal CLI: token
- Modal CLI: secret
- Modal CLI: environment
- Replicate Pricing
- Replicate Node.js Quickstart
- Replicate Predictions API
- Replicate API Tokens
- Replicate Official Models
- Lambda Pricing
- Lambda: Creating and Managing Instances
- Lambda Billing
- Lambda: Connecting to an Instance
- Lambda: Importing and Exporting Data
- Lambda Filesystems
- Lambda Filesystem S3 Adapter
If you want to turn the snippets above into team-safe examples, run them through the TechBytes Code Formatter before publishing them into internal docs or runbooks.
Frequently Asked Questions
Is Lambda Labs actually serverless in the same way as Modal? +
Which platform is cheapest for bursty GPU inference? +
Why is comparing H100 prices across these platforms so messy? +
H100 PCIe and H100 SXM instance pricing, Modal publishes per-second serverless execution pricing, and Replicate wraps GPU access inside a managed inference platform. Always compare effective cost per completed request, not just list price.When do idle charges matter most on Replicate? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
OpenAI API Pricing Cheat Sheet [2026]
A practical reference for token, image, and realtime pricing across the current OpenAI stack.
System ArchitectureVector Database Benchmarks for RAG [2026]
A field guide to latency, filtering, and cost tradeoffs across popular vector stores.
Cloud InfrastructureInference Endpoints vs Serverless GPUs
How dedicated endpoints compare with bursty GPU runtimes for production AI workloads.