Home Posts Multi-Cloud Arbitrage: 40% Infra Savings [2026]
Cloud Infrastructure

Multi-Cloud Arbitrage: 40% Infra Savings [2026]

Multi-Cloud Arbitrage: 40% Infra Savings [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 01, 2026 · 8 min read

Bottom Line

For stateless, queue-backed workloads, a small arbitrage controller can compare AWS Spot and Google Cloud Spot economics, move replicas to the cheaper cluster, and cut infrastructure spend without rewriting the app.

Key Takeaways

  • AWS Spot can be up to 90% cheaper; Google Cloud Spot discounts can reach 91%.
  • Use arbitrage only for stateless, queue-backed, interruption-tolerant workloads.
  • Add a 12% price delta threshold and a minimum hold window to stop relocation flapping.
  • Keep manifests portable and shift replicas between clusters instead of rebuilding images.
  • Verify success by checking winner/standby replica counts and rollout status in both clusters.

Multi-cloud arbitrage sounds exotic, but the mechanics are straightforward: run the same portable workload in two clouds, measure which one is cheaper right now, and move replicas only when the savings are meaningful. For interruption-tolerant services, this turns AWS Spot and Google Cloud Spot into a live market you can automate against. The key is not perfect prediction. The key is a conservative control loop that relocates only when price advantage is clear and operational risk stays low.

  • AWS Spot can deliver savings of up to 90% compared to On-Demand pricing.
  • Google Cloud Spot VMs can discount compute by up to 91%, with prices that can change once per day.
  • Use this pattern for queue-backed workers and stateless services, not primary stateful systems.
  • Add hysteresis, a hold window, and identical manifests to keep relocations safe.

Prerequisites

Bottom Line

Dynamic relocation works when the workload is portable and the controller is conservative. Treat price as a signal, not a command, and you can save real money without turning your platform into a science project.

Prerequisites box

  • An existing Amazon EKS cluster and an existing Google Kubernetes Engine cluster.
  • kubectl, AWS CLI, gcloud CLI, and bq authenticated against both environments.
  • A workload that tolerates interruption, restart, and cross-cloud IP changes.
  • Application state stored outside the Pod: queue, object storage, external database, or managed cache.
  • A billing export or pricing query you can safely share with teammates. If you circulate raw billing extracts, scrub account IDs first with the Data Masking Tool.

In this walkthrough, the workload is a queue worker. That is the ideal starting point because the relocation unit is simple: increase replicas in the cheaper cloud and scale the other side to zero.

Step 1: Discover Price Signals

1. Connect both clusters

Start by making both Kubernetes contexts reachable from the same control host. On AWS, update-kubeconfig writes or merges the EKS context. On Google Cloud, get-credentials adds the GKE context.

aws eks update-kubeconfig \
  --name tb-eks \
  --region us-east-1 \
  --alias aws-spot

gcloud container clusters get-credentials tb-gke \
  --location=us-central1 \
  --project YOUR_PROJECT_ID

2. Pull AWS Spot history

AWS exposes recent Spot prices directly. For a controller, you usually care about one or two instance families that match your worker profile.

aws ec2 describe-spot-price-history \
  --region us-east-1 \
  --instance-types c7g.large \
  --product-descriptions Linux/UNIX \
  --start-time 2026-05-01T00:00:00Z \
  --end-time 2026-05-01T01:00:00Z \
  --output json > aws-price.json

3. Pull current Google Cloud Spot pricing

Google Cloud recommends either the Cloud Billing Catalog API or Cloud Billing export. For repeatable automation, the pricing export is easier to query because it lands in a stable table named cloud_pricing_export.

bq query \
  --use_legacy_sql=false \
  'SELECT
     pricing_as_of_time,
     sku.description,
     tiered_rates.usd_amount AS usd_amount
   FROM `BILLING_EXPORT_DATASET.cloud_pricing_export`,
     UNNEST(list_price.tiered_rates) AS tiered_rates
   WHERE service.description = "Compute Engine"
     AND LOWER(sku.description) LIKE "%spot predefined instance core%"
     AND LOWER(sku.description) LIKE "%americas%"
   ORDER BY pricing_as_of_time DESC
   LIMIT 10' > gcp-price.json

This query is intentionally simple. Run it once, inspect the returned SKU descriptions, then tighten the filter for your machine family and region. The important point is architectural: use an official price feed, not a hand-maintained spreadsheet.

Step 2: Create a Portable Workload

The arbitrage controller only works if both clusters can run the exact same manifest. Keep the workload boring: no local disk assumptions, no cloud-specific load balancer requirement, and no session affinity.

apiVersion: v1
kind: Namespace
metadata:
  name: arbitrage-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: queue-worker
  namespace: arbitrage-demo
spec:
  replicas: 0
  selector:
    matchLabels:
      app: queue-worker
  template:
    metadata:
      labels:
        app: queue-worker
    spec:
      terminationGracePeriodSeconds: 15
      containers:
      - name: worker
        image: REGISTRY/worker:TAG
        env:
        - name: QUEUE_URL
          value: https://queue.example.internal/jobs
        - name: WORKER_CONCURRENCY
          value: "8"
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"

Apply the same manifest to both clusters.

kubectl --context=aws-spot apply -f workload.yaml
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke apply -f workload.yaml
Pro tip: Run one manifest in both clouds and let the controller change only replicas. That dramatically reduces blast radius compared to changing image, config, and placement at the same time.

Step 3: Write the Arbitrage Controller

Now build a tiny controller. It reads both price feeds, applies a switching threshold, and pushes replicas to the winner. The example below uses a 12% savings threshold so you do not churn for pennies.

import json
import subprocess
from pathlib import Path

AWS_CONTEXT = "aws-spot"
GCP_CONTEXT = "gke_YOUR_PROJECT_ID_us-central1_tb-gke"
NAMESPACE = "arbitrage-demo"
DEPLOYMENT = "queue-worker"
DESIRED_REPLICAS = 4
SWITCH_DELTA = 0.12
STATE_FILE = Path("arbiter-state.json")


def load_aws_price():
    data = json.loads(Path("aws-price.json").read_text())
    prices = [float(x["SpotPrice"]) for x in data["SpotPriceHistory"]]
    return min(prices)


def load_gcp_price():
    rows = json.loads(Path("gcp-price.json").read_text())
    return float(rows[0]["usd_amount"])


def winner(aws_price, gcp_price, last_winner):
    if aws_price < gcp_price * (1 - SWITCH_DELTA):
        return "aws"
    if gcp_price < aws_price * (1 - SWITCH_DELTA):
        return "gcp"
    return last_winner or ("aws" if aws_price <= gcp_price else "gcp")


def scale(context, replicas):
    patch = {
        "spec": {
            "replicas": replicas
        }
    }
    subprocess.run([
        "kubectl", "--context", context,
        "-n", NAMESPACE,
        "patch", f"deployment/{DEPLOYMENT}",
        "--type=merge",
        "-p", json.dumps(patch)
    ], check=True)


def main():
    state = json.loads(STATE_FILE.read_text()) if STATE_FILE.exists() else {}
    aws_price = load_aws_price()
    gcp_price = load_gcp_price()
    target = winner(aws_price, gcp_price, state.get("winner"))

    if target == "aws":
        scale(AWS_CONTEXT, DESIRED_REPLICAS)
        scale(GCP_CONTEXT, 0)
    else:
        scale(GCP_CONTEXT, DESIRED_REPLICAS)
        scale(AWS_CONTEXT, 0)

    STATE_FILE.write_text(json.dumps({
        "winner": target,
        "aws_price": aws_price,
        "gcp_price": gcp_price
    }, indent=2))

    print(json.dumps({
        "winner": target,
        "aws_price": aws_price,
        "gcp_price": gcp_price
    }, indent=2))


if __name__ == "__main__":
    main()

This controller is intentionally minimal. It does not try to be smart about partial placement, GPU drift, or service mesh migration. It solves one valuable problem: move all replicas of a portable worker to the cheaper cloud only when the gap is large enough to matter.

Step 4: Schedule and Run It

4. Run the control loop every 30 minutes

Do not reevaluate every minute. Both AWS and Google Cloud can change capacity characteristics faster than your workload should move. A 30-minute interval is a good starting point for batch workers.

*/30 * * * * cd /opt/arbitrage && /usr/bin/python3 arbiter.py >> /var/log/arbiter.log 2>&1

5. Add the two guardrails that stop most outages

  • A switching threshold: require at least 10-15% effective savings before moving.
  • A hold window: keep the current winner for a few hours unless the price gap becomes extreme.
  • Graceful shutdown: keep terminationGracePeriodSeconds realistic so workers finish or requeue jobs cleanly.
Watch out: If you point this at a stateful service, arbitrage can amplify failure instead of savings. Primary databases, sticky websocket sessions, and local-disk pipelines should not be your first relocation target.

Verification and Expected Output

Your first validation is simple: one cluster should hold live replicas, the other should be cold. Then confirm the active side completed rollout successfully.

python3 arbiter.py

kubectl --context=aws-spot -n arbitrage-demo get deployment queue-worker
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke -n arbitrage-demo get deployment queue-worker

kubectl --context=aws-spot -n arbitrage-demo rollout status deployment/queue-worker --timeout=120s
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke -n arbitrage-demo rollout status deployment/queue-worker --timeout=120s

Expected output looks like this:

{
  "winner": "gcp",
  "aws_price": 0.0284,
  "gcp_price": 0.0239
}

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
queue-worker   0/0     0            0           12m

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
queue-worker   4/4     4            4           12m

If that matches, the pattern is working. You now have a relocator that can drain one cloud and light up the other without rebuilding the app.

Troubleshooting Top 3

  • Problem: the workload flips clouds too often. Fix: raise the threshold from 12% to 15% and add a minimum hold time in the state file before another move is allowed.
  • Problem: Pods start in the target cloud but fail requests. Fix: check outbound firewall rules, queue reachability, DNS, and secrets parity before blaming the scheduler.
  • Problem: relocation succeeds but jobs are lost mid-run. Fix: make the worker idempotent, checkpoint progress, and requeue unacked work before shutting down the old side.

What's Next

  • Replace the simple price files with a real pipeline that joins price, interruption rate, and queue lag into a single placement score.
  • Promote from active/cold to active/warm once you trust the economics and need faster cutover.
  • Add cloud-specific penalties instead of raw price comparison alone. AWS Spot can interrupt with two minutes of notice; GKE Spot VMs terminate after a notice with a short shutdown window, so price is only part of the decision.
  • Clean up and standardize your scripts before handing them to platform teams. For quick formatting passes, the Code Formatter is useful for shell and Python snippets.

Frequently Asked Questions

Can I use multi-cloud arbitrage for databases or Kafka clusters? +
Not as your first use case. Multi-cloud arbitrage works best for stateless, queue-backed, interruption-tolerant services. Stateful systems can be moved, but the operational cost of replication, quorum management, and cross-cloud latency usually eats the savings.
How often should I relocate workloads between AWS and Google Cloud? +
A 15-60 minute control loop is usually enough. Faster loops tend to cause placement flapping, while slower loops leave savings on the table for bursty batch workloads. Always pair the interval with a hysteresis threshold and a minimum hold window.
What is the safest first workload for dynamic relocation? +
Start with CI runners, ETL workers, media rendering jobs, or queue consumers. These workloads already assume retries and can tolerate Pod restarts. If the app can be scaled to zero in one cloud without breaking user traffic, it is a strong candidate.
How do I prevent downtime during cloud relocation? +
Keep state outside the Pod, use readiness checks, and make shutdown behavior explicit. A good pattern is to let the new cloud become healthy first, then drain the old one by reducing replicas or stopping new work assignment. For long-running jobs, add checkpointing or requeue logic so interruption does not mean lost work.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.