Is Lambda Labs actually serverless in the same way as Modal?

No. Based on Lambda's current public cloud docs and billing pages, Lambda is primarily an on-demand GPU instance platform. It can be excellent value, but it bills while the instance is running, so it is better treated as a VM baseline than a scale-to-zero serverless runtime.

Which platform is cheapest for bursty GPU inference?

For bursty traffic, Modal often has the cleanest economics because you only pay for active compute time. Lambda can post a lower hourly GPU rate, but idle VM time can erase that advantage quickly. Replicate is usually the convenience-first option rather than the lowest-cost option.

Why is comparing H100 prices across these platforms so messy?

Because the billing model and hardware shape differ. Lambda exposes both H100 PCIe and H100 SXM instance pricing, Modal publishes per-second serverless execution pricing, and Replicate wraps GPU access inside a managed inference platform. Always compare effective cost per completed request, not just list price.

When do idle charges matter most on Replicate?

Idle charges matter most when you use private models that stay online on dedicated hardware. Replicate's pricing docs state that most private models bill for setup, idle, and active processing time, while public models are usually billed only for runtime.

Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]

Prices verified on May 7, 2026 from official vendor pricing and documentation. The tricky part is not just the posted $/GPU-hour number: Modal, Replicate, and Lambda meter work very differently. This cheat sheet is optimized for engineering teams that need a fast reference for bursty inference, always-on deployments, and cost modeling before they commit to one platform.

Modal H100: $3.9492/hr from $0.001097/sec.
Replicate H100: $5.49/hr for gpu-h100.
Lambda H100: $2.86/hr for 1x H100 PCIe, $3.78/hr for 1x H100 SXM.
Idle billing is the real separator: Modal minimizes it, Lambda charges while running, Replicate depends on model type.

Dimension	Modal	Replicate	Lambda	Edge
Service model	Serverless functions and containers	Hosted model API plus custom deployments	On-demand GPU instances and clusters	Modal for serverless purity
Single-GPU H100 listed price	$3.9492/hr	$5.49/hr	$2.86/hr PCIe or $3.78/hr SXM	Lambda on raw price
Listed A100 price	$2.4984/hr for A100 80GB	$5.04/hr for A100 80GB	$1.48/hr for A100 40GB	Modal on like-for-like 80GB
Listed L40S price	$1.9512/hr	$3.51/hr	Not listed on current public pricing	Modal
Low-end GPU option	T4 $0.5904/hr, L4 $0.7992/hr	T4 $0.81/hr	Quadro RTX 6000 $0.58/hr, A10 $0.86/hr	Mixed
Billing granularity	Per second actual compute time	Per second for hardware-timed runs; some official models use input or output pricing	Hourly pricing, billed in one-minute increments while running	Modal / Replicate
Idle billing	No idle charge for inactive serverless execution	Public models usually runtime-only; most private models bill setup, idle, and active time	Yes, while the instance is running	Modal
Operational access	Managed runtime	API-first, low infra control	Full VM access via SSH and JupyterLab	Lambda for control

Note: the price winner changes with workload shape. Lambda is frequently cheapest per posted GPU-hour, but that does not make it cheapest for spiky traffic if the VM sits idle. Replicate also mixes hardware-time pricing with model-level pricing, so compare the specific model page before locking estimates.

Pricing Matrix

Bottom Line

If you need true scale-to-zero GPU execution, start with Modal. If you need the fastest path to hosted model APIs, start with Replicate. If you need full-machine control and the lowest headline $/GPU-hour, use Lambda and budget for idle time.

Common SKUs at a glance

Modal publishes per-second GPU pricing, including B200, H200, H100, A100, L40S, A10, L4, and T4.
Replicate publishes hardware pricing for A100 80GB, H100, L40S, and T4, plus multi-GPU variants.
Lambda publishes instance pricing for B200, GH200, H100 PCIe, H100 SXM, A100, A10, A6000, and Quadro RTX 6000.

Watch out: A lower posted hourly price does not automatically mean a lower total bill. Lambda keeps charging while an instance is up, and Replicate private models usually bill setup and idle time in addition to active inference.

When To Choose Each

Choose Modal when:

You want serverless GPU jobs without paying for idle machines.
You are shipping Python-first inference, batch jobs, queues, or web endpoints.
You need access to newer GPUs like H200 or B200 with simple code-level configuration.
You want cleaner cost control for bursty traffic than a permanently running VM.

Choose Replicate when:

You want an API for hosted models now, not a platform migration project.
You value official models, stable model APIs, and built-in webhook patterns.
You are comfortable paying a premium for packaging, distribution, and marketplace convenience.
You want to expose inference quickly to web or product teams.

Choose Lambda when:

You need full VM access, SSH, custom services, notebooks, or background daemons.
Your jobs are long-running enough that VM-style billing is acceptable.
You want strong raw $/GPU-hour economics and are willing to manage more infrastructure.
You need a bridge from experimentation to more traditional cluster or multi-node workflows.

Pro tip: For internal cost reviews, model both active GPU minutes and wall-clock uptime. That single change usually explains why a cheap VM can lose to a more expensive serverless runtime on real traffic.

Commands By Purpose

Live command filter

Shortcut: press / to focus the filter and Esc to clear it.

Keyboard shortcuts for this reference page

Key	Action
`/`	Focus the live command filter
`Esc`	Clear the filter when the search box is active
`j`	Jump to the next H2 section
`k`	Jump to the previous H2 section

Bootstrap access

pip install modal
modal token new
modal token info

Run a model on Replicate from Node.js

export REPLICATE_API_TOKEN=r8_******
npm install replicate

import Replicate from "replicate";
const replicate = new Replicate();

const [output] = await replicate.run(
  "black-forest-labs/flux-schnell",
  {
    input: {
      prompt: "An astronaut riding a rainbow unicorn, cinematic, dramatic"
    }
  }
);

Iterate and deploy on Modal

modal run app.py::train
modal serve app.py
modal deploy --env=prod --stream-logs app.py

Replicate sync and async prediction patterns

curl -s -X POST \
  -H 'Prefer: wait' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"version": "5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa", "input": {"text": "Alice"}}' \
  https://api.replicate.com/v1/predictions

curl -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Cancel-After: 1m30s" \
  -H "Content-Type: application/json" \
  -d '{"input": {"prompt": "The sun rises slowly between tall buildings."}}' \
  https://api.replicate.com/v1/predictions

Access a Lambda instance

ssh -i '<SSH-KEY-FILE-PATH>' ubuntu@<INSTANCE-IP>
ssh -L <LOCAL-PORT>:127.0.0.1:<REMOTE-PORT> ubuntu@<INSTANCE-IP>

Move data into Lambda

rsync -av --info=progress2 <FILES> <USERNAME>@<SERVER-IP>:<REMOTE-PATH>

s5cmd ls s3://<S3-BUCKET>
rclone ls my-remote:example-bucket
rclone -P copy <REMOTE>:<BUCKET-NAME> <LOCAL-DIR>

Configuration

Modal profiles, environments, and secrets

Authentication can be stored with modal token set or environment variables like MODAL_TOKEN_ID and MODAL_TOKEN_SECRET.
Environment scoping is first-class: create one with modal environment create dev and select it with modal config set-environment dev or MODAL_ENVIRONMENT.
Secrets can be created with the CLI and injected into functions.

modal environment create dev
modal config set-environment dev
modal secret create api-creds OPENAI_API_KEY=sk-******

Replicate token hygiene

Replicate API tokens are 40-character secrets that begin with r8_.
Store them in environment variables, not source files.
If you paste examples into docs or tickets, scrub them first with TechBytes' Data Masking Tool.

export REPLICATE_API_TOKEN=r8_******

Lambda account setup

API keys authenticate Lambda Cloud API operations.
SSH keys are required before launching instances you want to access directly.
By default, Lambda allows only incoming ICMP and TCP/22 unless you change firewall rules.
Attached filesystems mount at /lambda/nfs/<FILESYSTEM_NAME>.

echo '<PUBLIC-KEY>' >> ~/.ssh/authorized_keys
cat ~/.ssh/authorized_keys

Advanced Usage

Modal: pick GPU type in code

Modal keeps GPU selection close to the workload definition, which is useful when you want code review to capture cost-sensitive changes.

import modal

app = modal.App()
image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="H100:2", image=image)
def run():
    import torch
    print(torch.cuda.is_available())

Modal supports appending :n to request multiple GPUs per container.
The current docs list GPU values including T4, L4, A10, L40S, A100, H100, H200, and B200.

Replicate: use sync for latency, async for orchestration

Use Prefer: wait for short-lived inference where you want the output in one request.
Use default async mode plus webhooks for background jobs.
Use Cancel-After to cap runaway cost on slow generations.
Remember that official models may use predictable per-output pricing instead of hardware-time pricing.

Lambda: treat storage and connectivity as first-class concerns

Single-GPU instances usually launch in 3-5 minutes; multi-GPU instances usually take 10-15 minutes.
Billing starts after the instance boots and passes health checks, then continues while the instance runs.
Use rsync, rclone, or s5cmd for data movement. The Filesystem S3 Adapter adds S3-compatible tooling support in select regions.
If you only need an internal service exposed locally, prefer an SSH tunnel over opening public ports.

ip -4 -br addr show enp5s0
nmap -Pn INSTANCE-IP-ADDRESS

Sources And Notes

If you want to turn the snippets above into team-safe examples, run them through the TechBytes Code Formatter before publishing them into internal docs or runbooks.

Serverless GPU Pricing Matrix [2026 Developer Cheat Sheet]

Bottom Line

Pricing Matrix

Bottom Line

Common SKUs at a glance

When To Choose Each

Choose Modal when:

Choose Replicate when:

Choose Lambda when:

Commands By Purpose

Keyboard shortcuts for this reference page

Bootstrap access

Run a model on Replicate from Node.js

Iterate and deploy on Modal

Replicate sync and async prediction patterns

Access a Lambda instance

Move data into Lambda

Configuration

Modal profiles, environments, and secrets

Replicate token hygiene

Lambda account setup

Advanced Usage

Modal: pick GPU type in code

Replicate: use sync for latency, async for orchestration

Lambda: treat storage and connectivity as first-class concerns

Sources And Notes

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox

Related Deep-Dives

OpenAI API Pricing Cheat Sheet [2026]

Vector Database Benchmarks for RAG [2026]

Inference Endpoints vs Serverless GPUs