[Deep Dive] Orchestrating Ephemeral GPU Clusters for AI
Bottom Line
By leveraging Terraform-driven ephemeral GPU clusters and Kubernetes spot instances, engineering teams can reduce AI training costs by up to 90% while maintaining the ability to scale to thousands of H100s on demand.
Key Takeaways
- ›Leverage Terraform for reproducible infrastructure as code (IaC) to avoid configuration drift in GPU nodes.
- ›Utilize Kubernetes Spot Instances with graceful termination handlers to slash training costs significantly.
- ›Deploy the NVIDIA Device Plugin and GPU Feature Discovery to automate resource allocation at the pod level.
- ›Implement Prometheus and DCGM Exporter to monitor GPU utilization and prevent 'zombie' nodes from inflating bills.
As AI model sizes continue to explode, the cost of maintaining 24/7 GPU availability has become the single largest line item for engineering organizations. The solution lies in ephemeral orchestration—spinning up massive compute clusters exactly when a training job starts and tearing them down the microsecond it finishes. In this guide, we will walk through the architecture of a production-grade, on-demand GPU cluster using Terraform, Kubernetes (EKS/GKE), and NVIDIA drivers to achieve maximum throughput with minimum waste.
Prerequisites & Environment
Required Stack
- Cloud Provider: AWS, GCP, or Azure account with high-limit GPU quotas (e.g., p4d or a2 series).
- Tools: Terraform v1.7+, kubectl, and Helm v3 installed locally.
- IAM Permissions: Administrative access to create VPCs, EKS/GKE clusters, and Auto Scaling Groups.
- CLI Config: Configured AWS CLI or gcloud SDK with active credentials.
Step 1: Provisioning IaC Foundations
The first step is defining the control plane. Unlike standard web clusters, GPU clusters require specific machine images (AMIs) pre-baked with CUDA and Docker/Containerd GPU runtimes. We use Terraform to define a VPC and a Kubernetes control plane. Before committing your infrastructure code, ensure it follows clean formatting standards with our Code Formatter tool.
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "ai-training-cluster"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
cpu_ops = {
instance_types = ["m5.xlarge"]
min_size = 2
max_size = 5
}
}
}
Bottom Line
Static GPU clusters are a legacy cost-sink. The modern standard is a Warm Control Plane with Cold Node Groups—only provisioning expensive H100 or A100 capacity when the scheduler detects a pending GPU-requested job.
Step 2: Configuring Auto-scaling with Spot Instances
To maximize cost savings, we must configure Cluster Autoscaler or Karpenter to recognize GPU requirements and provision Spot Instances. Spot instances can offer up to 90% savings but require handling potential preemptions.
eks_managed_node_groups = {
gpu_spot = {
instance_types = ["p3.2xlarge", "p3.8xlarge"]
capacity_type = "SPOT"
min_size = 0
max_size = 50
desired_size = 0
labels = {
"hardware-type" = "nvidia-gpu"
}
taints = [
{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}
]
}
}
Step 3: The NVIDIA Software Stack
Kubernetes does not natively "see" GPUs. You must install the NVIDIA Device Plugin. This allows the Kubelet to advertise GPU resources to the API server. We recommend using Helm for this deployment to manage updates easily.
- NVIDIA Device Plugin: Exposes GPU cores to the pod scheduler.
- GPU Feature Discovery: Automatically labels nodes with CUDA version, driver version, and GPU model.
- DCGM Exporter: Necessary for Prometheus to scrape hardware-level metrics like temperature and power draw.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--set failOnInitError=false
Step 4: Orchestrating the Training Job
When deploying a training job, you must specify the exact resource requests. Kubernetes will see the NO_SCHEDULE taint on the GPU nodes and only place pods there if the YAML includes the corresponding toleration.
apiVersion: batch/v1
kind: Job
metadata:
name: resnet-training
spec:
template:
spec:
containers:
- name: training-container
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 1 # Requesting 1 GPU
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
restartPolicy: OnFailure
Verification & Expected Output
Once the job is submitted, the Cluster Autoscaler will trigger. Expect a 3-5 minute delay as the cloud provider provisions the bare-metal or virtualized GPU node. You can verify the hardware is being utilized by exec-ing into the pod and running nvidia-smi.
Troubleshooting Top-3 Issues
- Insufficient Instance Quota: If your nodes are stuck in
Pending, check your Cloud Console for vCPU or GPU limits. Most providers default new accounts to 0 for high-end instances. - Driver/Kernel Mismatch: If
nvidia-smifails with a "Driver/library version mismatch", ensure your AMI version matches the NVIDIA Device Plugin requirements. Using Bottlerocket or Ubuntu Optimized AMIs usually solves this. - Taint Misconfiguration: If the pod stays
Pendingeven when nodes are available, verify that yourtolerationsin the Job YAML exactly match thetaintsapplied to the node group in Terraform.
What's Next: Multi-Instance GPU (MIG)
For organizations running smaller models, an A100 or H100 is often overkill. Multi-Instance GPU (MIG) allows you to partition a single physical GPU into up to 7 hardware-isolated instances. In the next chapter of this series, we will look at how to automate MIG partitioning using the NVIDIA GPU Operator to squeeze even more value out of your ephemeral clusters.
Frequently Asked Questions
How do I prevent data transfer bottlenecks in ephemeral GPU clusters? +
Which cloud provider has the best GPU availability for spot instances? +
Can I use Docker Compose instead of Kubernetes for ephemeral GPUs? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.