How do I prevent data transfer bottlenecks in ephemeral GPU clusters?

Use a high-performance shared file system like Amazon FSx for Lustre or GCP Filestore. These are optimized for sub-millisecond latency and can feed data to GPUs fast enough to keep utilization above 90%.

Which cloud provider has the best GPU availability for spot instances?

Currently, GCP and Azure tend to have better availability for A100s on the spot market, while AWS is the leader for H100 capacity through its EC2 Capacity Blocks, though these are not strictly 'spot'.

Can I use Docker Compose instead of Kubernetes for ephemeral GPUs?

For single-node training, Docker Compose with the NVIDIA Container Toolkit is fine. However, for multi-node orchestration and automatic cost scaling, Kubernetes is the industry standard.

[Deep Dive] Orchestrating Ephemeral GPU Clusters for AI

As AI model sizes continue to explode, the cost of maintaining 24/7 GPU availability has become the single largest line item for engineering organizations. The solution lies in ephemeral orchestration—spinning up massive compute clusters exactly when a training job starts and tearing them down the microsecond it finishes. In this guide, we will walk through the architecture of a production-grade, on-demand GPU cluster using Terraform, Kubernetes (EKS/GKE), and NVIDIA drivers to achieve maximum throughput with minimum waste.

Prerequisites & Environment

Required Stack

Cloud Provider: AWS, GCP, or Azure account with high-limit GPU quotas (e.g., p4d or a2 series).
Tools: Terraform v1.7+, kubectl, and Helm v3 installed locally.
IAM Permissions: Administrative access to create VPCs, EKS/GKE clusters, and Auto Scaling Groups.
CLI Config: Configured AWS CLI or gcloud SDK with active credentials.

Step 1: Provisioning IaC Foundations

The first step is defining the control plane. Unlike standard web clusters, GPU clusters require specific machine images (AMIs) pre-baked with CUDA and Docker/Containerd GPU runtimes. We use Terraform to define a VPC and a Kubernetes control plane. Before committing your infrastructure code, ensure it follows clean formatting standards with our Code Formatter tool.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "ai-training-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    cpu_ops = {
      instance_types = ["m5.xlarge"]
      min_size     = 2
      max_size     = 5
    }
  }
}

Bottom Line

Static GPU clusters are a legacy cost-sink. The modern standard is a Warm Control Plane with Cold Node Groups—only provisioning expensive H100 or A100 capacity when the scheduler detects a pending GPU-requested job.

Step 2: Configuring Auto-scaling with Spot Instances

To maximize cost savings, we must configure Cluster Autoscaler or Karpenter to recognize GPU requirements and provision Spot Instances. Spot instances can offer up to 90% savings but require handling potential preemptions.

eks_managed_node_groups = {
  gpu_spot = {
    instance_types = ["p3.2xlarge", "p3.8xlarge"]
    capacity_type  = "SPOT"
    
    min_size     = 0
    max_size     = 50
    desired_size = 0

    labels = {
      "hardware-type" = "nvidia-gpu"
    }

    taints = [
      {
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }
    ]
  }
}

Step 3: The NVIDIA Software Stack

Kubernetes does not natively "see" GPUs. You must install the NVIDIA Device Plugin. This allows the Kubelet to advertise GPU resources to the API server. We recommend using Helm for this deployment to manage updates easily.

NVIDIA Device Plugin: Exposes GPU cores to the pod scheduler.
GPU Feature Discovery: Automatically labels nodes with CUDA version, driver version, and GPU model.
DCGM Exporter: Necessary for Prometheus to scrape hardware-level metrics like temperature and power draw.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set failOnInitError=false

Step 4: Orchestrating the Training Job

When deploying a training job, you must specify the exact resource requests. Kubernetes will see the NO_SCHEDULE taint on the GPU nodes and only place pods there if the YAML includes the corresponding toleration.

apiVersion: batch/v1
kind: Job
metadata:
  name: resnet-training
spec:
  template:
    spec:
      containers:
      - name: training-container
        image: nvcr.io/nvidia/pytorch:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 1 # Requesting 1 GPU
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      restartPolicy: OnFailure

Verification & Expected Output

Once the job is submitted, the Cluster Autoscaler will trigger. Expect a 3-5 minute delay as the cloud provider provisions the bare-metal or virtualized GPU node. You can verify the hardware is being utilized by exec-ing into the pod and running nvidia-smi.

Pro tip: Always use Checkpoints in your training code (e.g., PyTorch Lightning's ModelCheckpoint). Since you are using ephemeral spot instances, a preemption could happen at any time. Resuming from a S3/GCS bucket ensures you don't lose progress.

Troubleshooting Top-3 Issues

Insufficient Instance Quota: If your nodes are stuck in Pending, check your Cloud Console for vCPU or GPU limits. Most providers default new accounts to 0 for high-end instances.
Driver/Kernel Mismatch: If nvidia-smi fails with a "Driver/library version mismatch", ensure your AMI version matches the NVIDIA Device Plugin requirements. Using Bottlerocket or Ubuntu Optimized AMIs usually solves this.
Taint Misconfiguration: If the pod stays Pending even when nodes are available, verify that your tolerations in the Job YAML exactly match the taints applied to the node group in Terraform.

What's Next: Multi-Instance GPU (MIG)

For organizations running smaller models, an A100 or H100 is often overkill. Multi-Instance GPU (MIG) allows you to partition a single physical GPU into up to 7 hardware-isolated instances. In the next chapter of this series, we will look at how to automate MIG partitioning using the NVIDIA GPU Operator to squeeze even more value out of your ephemeral clusters.

[Deep Dive] Orchestrating Ephemeral GPU Clusters for AI

Bottom Line

Prerequisites & Environment

Required Stack

Step 1: Provisioning IaC Foundations

Bottom Line

Step 2: Configuring Auto-scaling with Spot Instances

Step 3: The NVIDIA Software Stack

Step 4: Orchestrating the Training Job

Verification & Expected Output

Troubleshooting Top-3 Issues

What's Next: Multi-Instance GPU (MIG)

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox

Related Deep-Dives

Optimizing PyTorch Performance on H100 GPUs

Terraform Best Practices for AI Infrastructure