Serverless GPU Clusters for LLM Inference [Deep Dive]
Bottom Line
The clean pattern is to let KEDA wake inference replicas on queue pressure, then let Karpenter provision GPU nodes only for the pods that cannot yet schedule. You keep burst capacity for LLM traffic without paying for an always-on GPU pool.
Key Takeaways
- ›KEDA can scale a vLLM deployment from 0 based on
vllm:num_requests_waiting. - ›Karpenter provisions GPU nodes only after the inference pod is actually unschedulable.
- ›vLLM already exposes
/healthand Prometheus metrics, so probes and scaling stay simple. - ›Use on-demand first, then test spot or time-slicing after you validate latency.
A serverless GPU cluster is not magic; it is two control loops joined together. Karpenter creates GPU nodes only when an inference pod cannot schedule, KEDA scales replicas from zero when demand appears, and vLLM exposes the queue and latency signals that let you drive the whole system. This walkthrough uses current docs as of May 22, 2026 and targets an EKS-style Kubernetes cluster serving high-concurrency LLM traffic.
Prerequisites
What you need before you start
- A Kubernetes cluster that meets KEDA's current Kubernetes 1.30+ requirement.
- Karpenter already installed with AWS permissions to create nodes.
- Prometheus already scraping your cluster, typically through Prometheus Operator or
kube-prometheus-stack. - A Hugging Face token for your model, and quota for at least one GPU instance family such as g5.
- The kubectl, helm, and aws CLIs configured locally.
- If you want to sanity-check YAML before applying it, run it through TechBytes Code Formatter.
Bottom Line
Scale the pods with KEDA and the nodes with Karpenter. That separation is what makes serverless GPU inference predictable under burst load.
The architecture in this tutorial is intentionally simple: one vLLM deployment, one GPU NodePool, one Prometheus-backed KEDA trigger, and one public or internal service in front of it. Start there. You can add spot capacity, shared GPUs, or multiple model pools after you prove the baseline SLO.
Build the stack
These steps assume EKS, but the pattern generalizes to any cluster with an equivalent node autoscaler. The important distinction is that KEDA decides when more replicas are needed, while Karpenter decides where those replicas can land.
1. Create a GPU-only NodeClass and NodePool
Keep the GPU pool isolated so general workloads never consume expensive nodes by accident. A taint plus explicit requirements is enough for the first pass.
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-burst
spec:
role: KarpenterNodeRole-${CLUSTER_NAME}
amiSelectorTerms:
- alias: al2@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-burst
spec:
template:
metadata:
labels:
workload: llm
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-burst
taints:
- key: nvidia.com/gpu
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: [amd64]
- key: kubernetes.io/os
operator: In
values: [linux]
- key: karpenter.sh/capacity-type
operator: In
values: [on-demand]
- key: karpenter.k8s.aws/instance-family
operator: In
values: [g5]
- key: karpenter.k8s.aws/instance-gpu-count
operator: In
values: ['1']
limits:
cpu: 128
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 2m
This is the moment the stack becomes “serverless.” There is still no GPU node running. Karpenter will wait until a real pod requests nvidia.com/gpu.
2. Install the NVIDIA device plugin
vLLM can request GPUs only after Kubernetes knows how to expose them. NVIDIA's official k8s-device-plugin chart is the fastest path.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version 0.17.1 --namespace nvidia-device-plugin --create-namespace --set gfd.enabled=true
kubectl get nodes -L nvidia.com/gpu.present,nvidia.com/gpu.product
If you later decide to use time-slicing, do it only after you have latency numbers for single-tenant GPUs. Shared GPU capacity can improve cost efficiency, but it can also make p95 and p99 inference behavior much harder to reason about.
3. Deploy vLLM with health checks and metrics
vLLM already exposes the pieces you need: /health for probes and /metrics for Prometheus. The deployment below starts at 0 replicas so you only pay when traffic arrives.
kubectl create namespace llm
kubectl create secret generic hf-token -n llm --from-literal=HF_TOKEN=$HF_TOKEN
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: llm
spec:
replicas: 0
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: server
image: vllm/vllm-openai
args:
- serve
- meta-llama/Llama-3.1-8B-Instruct
- --served-model-name
- llama-8b
- --host
- 0.0.0.0
- --port
- '8000'
- --gpu-memory-utilization
- '0.92'
- --max-model-len
- '8192'
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
resources:
requests:
cpu: '6'
memory: 24Gi
nvidia.com/gpu: '1'
limits:
cpu: '6'
memory: 24Gi
nvidia.com/gpu: '1'
---
apiVersion: v1
kind: Service
metadata:
name: vllm
namespace: llm
labels:
app: vllm
spec:
selector:
app: vllm
ports:
- name: http
port: 8000
targetPort: 8000
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm
namespace: llm
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
path: /metrics
interval: 15s
The most important metric for burst scaling is vllm:num_requests_waiting. It directly measures queued work, which is much more useful than CPU percentage for token generation systems.
4. Install KEDA and scale on queue depth
KEDA will watch Prometheus and drive the Deployment replica count. One pending replica is enough to wake Karpenter and trigger GPU node creation.
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-queue
namespace: llm
spec:
scaleTargetRef:
name: vllm
minReplicaCount: 0
maxReplicaCount: 8
pollingInterval: 15
cooldownPeriod: 180
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-service.monitoring.svc.cluster.local:9090
query: 'sum(vllm:num_requests_waiting{pod=~"vllm-.*"})'
threshold: '1'
activationThreshold: '1'
Replace serverAddress with the DNS name of your Prometheus service. Also check the exact target labels your Prometheus scrape adds; if you do not have a pod label, change the selector in the query to whatever label uniquely identifies the vLLM target.
5. Generate concurrent load
Now trigger a burst and watch the chain reaction: KEDA raises replicas, Kubernetes marks the pod pending, Karpenter creates a GPU node, the NVIDIA plugin advertises the device, and the pod turns ready.
kubectl -n llm port-forward svc/vllm 8000:8000
for i in $(seq 1 40); do
curl -s http://127.0.0.1:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"llama-8b","prompt":"Explain serverless GPU autoscaling in one sentence.","max_tokens":64}' > /dev/null &
done
wait
For a production gateway, put an internal or public load balancer in front of the service and point clients at the OpenAI-compatible endpoint exposed by vLLM.
Verification and expected output
You are looking for evidence that both scaling loops fired in the right order. These checks are faster than staring at a dashboard and guessing.
kubectl get scaledobject -n llm
kubectl get deploy,pods -n llm -w
kubectl get nodeclaims
kubectl get nodes -l karpenter.sh/nodepool=gpu-burst
Expected results:
- The
ScaledObjectbecomes active shortly after the burst and increases desired replicas above0. - A new
NodeClaimappears, followed by a fresh GPU node in thegpu-burstNodePool. - The vLLM pod transitions from
PendingtoRunningonce the node is ready and the model finishes loading. - After traffic drains and the cooldown expires, replicas fall back to
0and Karpenter consolidates the empty GPU node.
NAME READY ACTIVE FALLBACK PAUSED TRIGGERS
vllm-queue True True False False prometheus
NAME READY STATUS RESTARTS AGE
vllm-7f8c9c9d6b-2kqxh 1/1 Running 0 3m
If your metrics path is correct, you should also see vllm:num_requests_waiting spike during the burst and decay back toward zero as the backlog clears.
Troubleshooting
- No GPU node appears. Check
kubectl describe podon the pending vLLM pod first. If the event stream shows unsatisfied quotas, bad subnet or security group selectors, or no matching AMI, Karpenter is blocked before it even gets to EC2 capacity. - The pod schedules but never becomes ready. That is usually a model pull or VRAM problem. Verify the Hugging Face token secret, then inspect logs for out-of-memory messages and lower --max-model-len or choose a smaller model.
- KEDA never scales. Query Prometheus directly and confirm the metric exists with the labels used in your trigger. Most failed KEDA setups are not autoscaler bugs; they are bad
serverAddressvalues or PromQL filters that match nothing.
If you need to share manifests or logs outside your org, scrub tokens, account IDs, and service names first. TechBytes' Data Masking Tool is useful for that cleanup step.
What's next
Once the baseline path is stable, improve efficiency deliberately instead of all at once.
- Add a second NodePool for spot and route only background or non-SLO traffic there.
- Split models by traffic class so small, fast models do not queue behind large, slow ones on the same GPU fleet.
- Introduce time-slicing only after measuring how it changes tail latency for your prompts and decode lengths.
- Pin AMIs, model images, and autoscaler versions in CI so scale events remain reproducible during incident response.
- Move from a single deployment to multiple pools when you need tenant isolation, regional sharding, or mixed GPU families.
The core lesson is that “serverless GPUs” are really disciplined scheduling. Treat queue depth as the demand signal, keep node creation separate from replica creation, and the cluster stays understandable even when concurrency spikes hard.
Frequently Asked Questions
Can KEDA really scale GPU inference down to zero? +
Deployment replica count, not the node count, so setting minReplicaCount: 0 is valid. When no pods need a GPU, Karpenter can consolidate the empty node and remove the last GPU instance.Why use vLLM queue metrics instead of CPU for autoscaling? +
How do I reduce cold-start latency in a serverless GPU cluster? +
Should I enable NVIDIA time-slicing for LLM inference? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.