Home Posts Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]
Developer Reference

Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]

Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 07, 2026 · 12 min read

Bottom Line

Modal is the cleanest scale-to-zero reference for serverless GPU jobs, Replicate is the fastest path to hosted model APIs, and Lambda usually wins raw per-GPU VM pricing but does not behave like true serverless.

Key Takeaways

  • Modal lists H100 at $3.9492/hr; Replicate lists H100 at $5.49/hr.
  • Lambda lists 1x H100 PCIe at $2.86/hr and 1x H100 SXM at $3.78/hr.
  • Modal bills active compute time; Lambda bills while the instance is running; Replicate depends on model type.
  • Replicate public models are often runtime-billed, but most private deployments also bill setup and idle time.

Prices verified on May 7, 2026 from official vendor pricing and documentation. The tricky part is not just the posted $/GPU-hour number: Modal, Replicate, and Lambda meter work very differently. This cheat sheet is optimized for engineering teams that need a fast reference for bursty inference, always-on deployments, and cost modeling before they commit to one platform.

  • Modal H100: $3.9492/hr from $0.001097/sec.
  • Replicate H100: $5.49/hr for gpu-h100.
  • Lambda H100: $2.86/hr for 1x H100 PCIe, $3.78/hr for 1x H100 SXM.
  • Idle billing is the real separator: Modal minimizes it, Lambda charges while running, Replicate depends on model type.
DimensionModalReplicateLambdaEdge
Service modelServerless functions and containersHosted model API plus custom deploymentsOn-demand GPU instances and clustersModal for serverless purity
Single-GPU H100 listed price$3.9492/hr$5.49/hr$2.86/hr PCIe or $3.78/hr SXMLambda on raw price
Listed A100 price$2.4984/hr for A100 80GB$5.04/hr for A100 80GB$1.48/hr for A100 40GBModal on like-for-like 80GB
Listed L40S price$1.9512/hr$3.51/hrNot listed on current public pricingModal
Low-end GPU optionT4 $0.5904/hr, L4 $0.7992/hrT4 $0.81/hrQuadro RTX 6000 $0.58/hr, A10 $0.86/hrMixed
Billing granularityPer second actual compute timePer second for hardware-timed runs; some official models use input or output pricingHourly pricing, billed in one-minute increments while runningModal / Replicate
Idle billingNo idle charge for inactive serverless executionPublic models usually runtime-only; most private models bill setup, idle, and active timeYes, while the instance is runningModal
Operational accessManaged runtimeAPI-first, low infra controlFull VM access via SSH and JupyterLabLambda for control

Note: the price winner changes with workload shape. Lambda is frequently cheapest per posted GPU-hour, but that does not make it cheapest for spiky traffic if the VM sits idle. Replicate also mixes hardware-time pricing with model-level pricing, so compare the specific model page before locking estimates.

Pricing Matrix

Bottom Line

If you need true scale-to-zero GPU execution, start with Modal. If you need the fastest path to hosted model APIs, start with Replicate. If you need full-machine control and the lowest headline $/GPU-hour, use Lambda and budget for idle time.

Common SKUs at a glance

  • Modal publishes per-second GPU pricing, including B200, H200, H100, A100, L40S, A10, L4, and T4.
  • Replicate publishes hardware pricing for A100 80GB, H100, L40S, and T4, plus multi-GPU variants.
  • Lambda publishes instance pricing for B200, GH200, H100 PCIe, H100 SXM, A100, A10, A6000, and Quadro RTX 6000.
Watch out: A lower posted hourly price does not automatically mean a lower total bill. Lambda keeps charging while an instance is up, and Replicate private models usually bill setup and idle time in addition to active inference.

When To Choose Each

Choose Modal when:

  • You want serverless GPU jobs without paying for idle machines.
  • You are shipping Python-first inference, batch jobs, queues, or web endpoints.
  • You need access to newer GPUs like H200 or B200 with simple code-level configuration.
  • You want cleaner cost control for bursty traffic than a permanently running VM.

Choose Replicate when:

  • You want an API for hosted models now, not a platform migration project.
  • You value official models, stable model APIs, and built-in webhook patterns.
  • You are comfortable paying a premium for packaging, distribution, and marketplace convenience.
  • You want to expose inference quickly to web or product teams.

Choose Lambda when:

  • You need full VM access, SSH, custom services, notebooks, or background daemons.
  • Your jobs are long-running enough that VM-style billing is acceptable.
  • You want strong raw $/GPU-hour economics and are willing to manage more infrastructure.
  • You need a bridge from experimentation to more traditional cluster or multi-node workflows.
Pro tip: For internal cost reviews, model both active GPU minutes and wall-clock uptime. That single change usually explains why a cheap VM can lose to a more expensive serverless runtime on real traffic.

Commands By Purpose

Shortcut: press / to focus the filter and Esc to clear it.

Keyboard shortcuts for this reference page

KeyAction
/Focus the live command filter
EscClear the filter when the search box is active
jJump to the next H2 section
kJump to the previous H2 section

Bootstrap access

pip install modal
modal token new
modal token info

Run a model on Replicate from Node.js

export REPLICATE_API_TOKEN=r8_******
npm install replicate
import Replicate from "replicate";
const replicate = new Replicate();

const [output] = await replicate.run(
  "black-forest-labs/flux-schnell",
  {
    input: {
      prompt: "An astronaut riding a rainbow unicorn, cinematic, dramatic"
    }
  }
);

Iterate and deploy on Modal

modal run app.py::train
modal serve app.py
modal deploy --env=prod --stream-logs app.py

Replicate sync and async prediction patterns

curl -s -X POST \
  -H 'Prefer: wait' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"version": "5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa", "input": {"text": "Alice"}}' \
  https://api.replicate.com/v1/predictions
curl -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Cancel-After: 1m30s" \
  -H "Content-Type: application/json" \
  -d '{"input": {"prompt": "The sun rises slowly between tall buildings."}}' \
  https://api.replicate.com/v1/predictions

Access a Lambda instance

ssh -i '<SSH-KEY-FILE-PATH>' ubuntu@<INSTANCE-IP>
ssh -L <LOCAL-PORT>:127.0.0.1:<REMOTE-PORT> ubuntu@<INSTANCE-IP>

Move data into Lambda

rsync -av --info=progress2 <FILES> <USERNAME>@<SERVER-IP>:<REMOTE-PATH>
s5cmd ls s3://<S3-BUCKET>
rclone ls my-remote:example-bucket
rclone -P copy <REMOTE>:<BUCKET-NAME> <LOCAL-DIR>

Configuration

Modal profiles, environments, and secrets

  • Authentication can be stored with modal token set or environment variables like MODAL_TOKEN_ID and MODAL_TOKEN_SECRET.
  • Environment scoping is first-class: create one with modal environment create dev and select it with modal config set-environment dev or MODAL_ENVIRONMENT.
  • Secrets can be created with the CLI and injected into functions.
modal environment create dev
modal config set-environment dev
modal secret create api-creds OPENAI_API_KEY=sk-******

Replicate token hygiene

  • Replicate API tokens are 40-character secrets that begin with r8_.
  • Store them in environment variables, not source files.
  • If you paste examples into docs or tickets, scrub them first with TechBytes' Data Masking Tool.
export REPLICATE_API_TOKEN=r8_******

Lambda account setup

  • API keys authenticate Lambda Cloud API operations.
  • SSH keys are required before launching instances you want to access directly.
  • By default, Lambda allows only incoming ICMP and TCP/22 unless you change firewall rules.
  • Attached filesystems mount at /lambda/nfs/<FILESYSTEM_NAME>.
echo '<PUBLIC-KEY>' >> ~/.ssh/authorized_keys
cat ~/.ssh/authorized_keys

Advanced Usage

Modal: pick GPU type in code

Modal keeps GPU selection close to the workload definition, which is useful when you want code review to capture cost-sensitive changes.

import modal

app = modal.App()
image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="H100:2", image=image)
def run():
    import torch
    print(torch.cuda.is_available())
  • Modal supports appending :n to request multiple GPUs per container.
  • The current docs list GPU values including T4, L4, A10, L40S, A100, H100, H200, and B200.

Replicate: use sync for latency, async for orchestration

  • Use Prefer: wait for short-lived inference where you want the output in one request.
  • Use default async mode plus webhooks for background jobs.
  • Use Cancel-After to cap runaway cost on slow generations.
  • Remember that official models may use predictable per-output pricing instead of hardware-time pricing.

Lambda: treat storage and connectivity as first-class concerns

  • Single-GPU instances usually launch in 3-5 minutes; multi-GPU instances usually take 10-15 minutes.
  • Billing starts after the instance boots and passes health checks, then continues while the instance runs.
  • Use rsync, rclone, or s5cmd for data movement. The Filesystem S3 Adapter adds S3-compatible tooling support in select regions.
  • If you only need an internal service exposed locally, prefer an SSH tunnel over opening public ports.
ip -4 -br addr show enp5s0
nmap -Pn INSTANCE-IP-ADDRESS

Sources And Notes

If you want to turn the snippets above into team-safe examples, run them through the TechBytes Code Formatter before publishing them into internal docs or runbooks.

Frequently Asked Questions

Is Lambda Labs actually serverless in the same way as Modal? +
No. Based on Lambda's current public cloud docs and billing pages, Lambda is primarily an on-demand GPU instance platform. It can be excellent value, but it bills while the instance is running, so it is better treated as a VM baseline than a scale-to-zero serverless runtime.
Which platform is cheapest for bursty GPU inference? +
For bursty traffic, Modal often has the cleanest economics because you only pay for active compute time. Lambda can post a lower hourly GPU rate, but idle VM time can erase that advantage quickly. Replicate is usually the convenience-first option rather than the lowest-cost option.
Why is comparing H100 prices across these platforms so messy? +
Because the billing model and hardware shape differ. Lambda exposes both H100 PCIe and H100 SXM instance pricing, Modal publishes per-second serverless execution pricing, and Replicate wraps GPU access inside a managed inference platform. Always compare effective cost per completed request, not just list price.
When do idle charges matter most on Replicate? +
Idle charges matter most when you use private models that stay online on dedicated hardware. Replicate's pricing docs state that most private models bill for setup, idle, and active processing time, while public models are usually billed only for runtime.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.