Home Posts Serverless GPU Clusters for LLM Inference [Deep Dive]
Cloud Infrastructure

Serverless GPU Clusters for LLM Inference [Deep Dive]

Serverless GPU Clusters for LLM Inference [Deep Dive]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 22, 2026 · 9 min read

Bottom Line

The clean pattern is to let KEDA wake inference replicas on queue pressure, then let Karpenter provision GPU nodes only for the pods that cannot yet schedule. You keep burst capacity for LLM traffic without paying for an always-on GPU pool.

Key Takeaways

  • KEDA can scale a vLLM deployment from 0 based on vllm:num_requests_waiting.
  • Karpenter provisions GPU nodes only after the inference pod is actually unschedulable.
  • vLLM already exposes /health and Prometheus metrics, so probes and scaling stay simple.
  • Use on-demand first, then test spot or time-slicing after you validate latency.

A serverless GPU cluster is not magic; it is two control loops joined together. Karpenter creates GPU nodes only when an inference pod cannot schedule, KEDA scales replicas from zero when demand appears, and vLLM exposes the queue and latency signals that let you drive the whole system. This walkthrough uses current docs as of May 22, 2026 and targets an EKS-style Kubernetes cluster serving high-concurrency LLM traffic.

Prerequisites

What you need before you start

  • A Kubernetes cluster that meets KEDA's current Kubernetes 1.30+ requirement.
  • Karpenter already installed with AWS permissions to create nodes.
  • Prometheus already scraping your cluster, typically through Prometheus Operator or kube-prometheus-stack.
  • A Hugging Face token for your model, and quota for at least one GPU instance family such as g5.
  • The kubectl, helm, and aws CLIs configured locally.
  • If you want to sanity-check YAML before applying it, run it through TechBytes Code Formatter.

Bottom Line

Scale the pods with KEDA and the nodes with Karpenter. That separation is what makes serverless GPU inference predictable under burst load.

The architecture in this tutorial is intentionally simple: one vLLM deployment, one GPU NodePool, one Prometheus-backed KEDA trigger, and one public or internal service in front of it. Start there. You can add spot capacity, shared GPUs, or multiple model pools after you prove the baseline SLO.

Watch out: Karpenter supports latest AMI aliases, but production clusters should pin a tested AMI release before you let it replace nodes automatically.

Build the stack

These steps assume EKS, but the pattern generalizes to any cluster with an equivalent node autoscaler. The important distinction is that KEDA decides when more replicas are needed, while Karpenter decides where those replicas can land.

1. Create a GPU-only NodeClass and NodePool

Keep the GPU pool isolated so general workloads never consume expensive nodes by accident. A taint plus explicit requirements is enough for the first pass.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-burst
spec:
  role: KarpenterNodeRole-${CLUSTER_NAME}
  amiSelectorTerms:
    - alias: al2@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-burst
spec:
  template:
    metadata:
      labels:
        workload: llm
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-burst
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: kubernetes.io/os
          operator: In
          values: [linux]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [g5]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ['1']
  limits:
    cpu: 128
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 2m

This is the moment the stack becomes “serverless.” There is still no GPU node running. Karpenter will wait until a real pod requests nvidia.com/gpu.

2. Install the NVIDIA device plugin

vLLM can request GPUs only after Kubernetes knows how to expose them. NVIDIA's official k8s-device-plugin chart is the fastest path.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version 0.17.1 --namespace nvidia-device-plugin --create-namespace --set gfd.enabled=true
kubectl get nodes -L nvidia.com/gpu.present,nvidia.com/gpu.product

If you later decide to use time-slicing, do it only after you have latency numbers for single-tenant GPUs. Shared GPU capacity can improve cost efficiency, but it can also make p95 and p99 inference behavior much harder to reason about.

3. Deploy vLLM with health checks and metrics

vLLM already exposes the pieces you need: /health for probes and /metrics for Prometheus. The deployment below starts at 0 replicas so you only pay when traffic arrives.

kubectl create namespace llm
kubectl create secret generic hf-token -n llm --from-literal=HF_TOKEN=$HF_TOKEN
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: llm
spec:
  replicas: 0
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: server
          image: vllm/vllm-openai
          args:
            - serve
            - meta-llama/Llama-3.1-8B-Instruct
            - --served-model-name
            - llama-8b
            - --host
            - 0.0.0.0
            - --port
            - '8000'
            - --gpu-memory-utilization
            - '0.92'
            - --max-model-len
            - '8192'
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: HF_TOKEN
          ports:
            - containerPort: 8000
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
          resources:
            requests:
              cpu: '6'
              memory: 24Gi
              nvidia.com/gpu: '1'
            limits:
              cpu: '6'
              memory: 24Gi
              nvidia.com/gpu: '1'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: llm
  labels:
    app: vllm
spec:
  selector:
    app: vllm
  ports:
    - name: http
      port: 8000
      targetPort: 8000
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm
  namespace: llm
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

The most important metric for burst scaling is vllm:num_requests_waiting. It directly measures queued work, which is much more useful than CPU percentage for token generation systems.

4. Install KEDA and scale on queue depth

KEDA will watch Prometheus and drive the Deployment replica count. One pending replica is enough to wake Karpenter and trigger GPU node creation.

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-queue
  namespace: llm
spec:
  scaleTargetRef:
    name: vllm
  minReplicaCount: 0
  maxReplicaCount: 8
  pollingInterval: 15
  cooldownPeriod: 180
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-service.monitoring.svc.cluster.local:9090
        query: 'sum(vllm:num_requests_waiting{pod=~"vllm-.*"})'
        threshold: '1'
        activationThreshold: '1'

Replace serverAddress with the DNS name of your Prometheus service. Also check the exact target labels your Prometheus scrape adds; if you do not have a pod label, change the selector in the query to whatever label uniquely identifies the vLLM target.

Pro tip: Keep minReplicaCount at 0 in dev and non-critical APIs, but consider 1 warm replica for customer-facing latency paths where cold starts are unacceptable.

5. Generate concurrent load

Now trigger a burst and watch the chain reaction: KEDA raises replicas, Kubernetes marks the pod pending, Karpenter creates a GPU node, the NVIDIA plugin advertises the device, and the pod turns ready.

kubectl -n llm port-forward svc/vllm 8000:8000
for i in $(seq 1 40); do
  curl -s http://127.0.0.1:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"llama-8b","prompt":"Explain serverless GPU autoscaling in one sentence.","max_tokens":64}' > /dev/null &
done
wait

For a production gateway, put an internal or public load balancer in front of the service and point clients at the OpenAI-compatible endpoint exposed by vLLM.

Verification and expected output

You are looking for evidence that both scaling loops fired in the right order. These checks are faster than staring at a dashboard and guessing.

kubectl get scaledobject -n llm
kubectl get deploy,pods -n llm -w
kubectl get nodeclaims
kubectl get nodes -l karpenter.sh/nodepool=gpu-burst

Expected results:

  • The ScaledObject becomes active shortly after the burst and increases desired replicas above 0.
  • A new NodeClaim appears, followed by a fresh GPU node in the gpu-burst NodePool.
  • The vLLM pod transitions from Pending to Running once the node is ready and the model finishes loading.
  • After traffic drains and the cooldown expires, replicas fall back to 0 and Karpenter consolidates the empty GPU node.
NAME         READY   ACTIVE   FALLBACK   PAUSED   TRIGGERS
vllm-queue   True    True     False      False    prometheus

NAME                    READY   STATUS    RESTARTS   AGE
vllm-7f8c9c9d6b-2kqxh   1/1     Running   0          3m

If your metrics path is correct, you should also see vllm:num_requests_waiting spike during the burst and decay back toward zero as the backlog clears.

Troubleshooting

  1. No GPU node appears. Check kubectl describe pod on the pending vLLM pod first. If the event stream shows unsatisfied quotas, bad subnet or security group selectors, or no matching AMI, Karpenter is blocked before it even gets to EC2 capacity.
  2. The pod schedules but never becomes ready. That is usually a model pull or VRAM problem. Verify the Hugging Face token secret, then inspect logs for out-of-memory messages and lower --max-model-len or choose a smaller model.
  3. KEDA never scales. Query Prometheus directly and confirm the metric exists with the labels used in your trigger. Most failed KEDA setups are not autoscaler bugs; they are bad serverAddress values or PromQL filters that match nothing.

If you need to share manifests or logs outside your org, scrub tokens, account IDs, and service names first. TechBytes' Data Masking Tool is useful for that cleanup step.

What's next

Once the baseline path is stable, improve efficiency deliberately instead of all at once.

  • Add a second NodePool for spot and route only background or non-SLO traffic there.
  • Split models by traffic class so small, fast models do not queue behind large, slow ones on the same GPU fleet.
  • Introduce time-slicing only after measuring how it changes tail latency for your prompts and decode lengths.
  • Pin AMIs, model images, and autoscaler versions in CI so scale events remain reproducible during incident response.
  • Move from a single deployment to multiple pools when you need tenant isolation, regional sharding, or mixed GPU families.

The core lesson is that “serverless GPUs” are really disciplined scheduling. Treat queue depth as the demand signal, keep node creation separate from replica creation, and the cluster stays understandable even when concurrency spikes hard.

Frequently Asked Questions

Can KEDA really scale GPU inference down to zero? +
Yes. KEDA scales the Deployment replica count, not the node count, so setting minReplicaCount: 0 is valid. When no pods need a GPU, Karpenter can consolidate the empty node and remove the last GPU instance.
Why use vLLM queue metrics instead of CPU for autoscaling? +
CPU is a weak signal for LLM serving because token generation, batching, and KV-cache behavior can keep latency high even when CPU is not saturated. vllm:numrequestswaiting measures backlog directly, so it maps much better to user-visible contention.
How do I reduce cold-start latency in a serverless GPU cluster? +
The simplest fix is to keep one warm replica during business hours or for premium traffic. You can also pre-pull images, pin smaller startup models, and reduce model load time by avoiding oversized context windows or unnecessary adapters.
Should I enable NVIDIA time-slicing for LLM inference? +
Only after you have a clean single-tenant baseline. Time-slicing can improve utilization, but it also introduces more noisy-neighbor behavior and can stretch p95 and p99 latency for long decode workloads.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.