Multi-Cloud Arbitrage: 40% Infra Savings [2026]
Bottom Line
For stateless, queue-backed workloads, a small arbitrage controller can compare AWS Spot and Google Cloud Spot economics, move replicas to the cheaper cluster, and cut infrastructure spend without rewriting the app.
Key Takeaways
- ›AWS Spot can be up to 90% cheaper; Google Cloud Spot discounts can reach 91%.
- ›Use arbitrage only for stateless, queue-backed, interruption-tolerant workloads.
- ›Add a 12% price delta threshold and a minimum hold window to stop relocation flapping.
- ›Keep manifests portable and shift replicas between clusters instead of rebuilding images.
- ›Verify success by checking winner/standby replica counts and rollout status in both clusters.
Multi-cloud arbitrage sounds exotic, but the mechanics are straightforward: run the same portable workload in two clouds, measure which one is cheaper right now, and move replicas only when the savings are meaningful. For interruption-tolerant services, this turns AWS Spot and Google Cloud Spot into a live market you can automate against. The key is not perfect prediction. The key is a conservative control loop that relocates only when price advantage is clear and operational risk stays low.
- AWS Spot can deliver savings of up to 90% compared to On-Demand pricing.
- Google Cloud Spot VMs can discount compute by up to 91%, with prices that can change once per day.
- Use this pattern for queue-backed workers and stateless services, not primary stateful systems.
- Add hysteresis, a hold window, and identical manifests to keep relocations safe.
Prerequisites
Bottom Line
Dynamic relocation works when the workload is portable and the controller is conservative. Treat price as a signal, not a command, and you can save real money without turning your platform into a science project.
Prerequisites box
- An existing Amazon EKS cluster and an existing Google Kubernetes Engine cluster.
- kubectl, AWS CLI, gcloud CLI, and bq authenticated against both environments.
- A workload that tolerates interruption, restart, and cross-cloud IP changes.
- Application state stored outside the Pod: queue, object storage, external database, or managed cache.
- A billing export or pricing query you can safely share with teammates. If you circulate raw billing extracts, scrub account IDs first with the Data Masking Tool.
In this walkthrough, the workload is a queue worker. That is the ideal starting point because the relocation unit is simple: increase replicas in the cheaper cloud and scale the other side to zero.
Step 1: Discover Price Signals
1. Connect both clusters
Start by making both Kubernetes contexts reachable from the same control host. On AWS, update-kubeconfig writes or merges the EKS context. On Google Cloud, get-credentials adds the GKE context.
aws eks update-kubeconfig \
--name tb-eks \
--region us-east-1 \
--alias aws-spot
gcloud container clusters get-credentials tb-gke \
--location=us-central1 \
--project YOUR_PROJECT_ID
2. Pull AWS Spot history
AWS exposes recent Spot prices directly. For a controller, you usually care about one or two instance families that match your worker profile.
aws ec2 describe-spot-price-history \
--region us-east-1 \
--instance-types c7g.large \
--product-descriptions Linux/UNIX \
--start-time 2026-05-01T00:00:00Z \
--end-time 2026-05-01T01:00:00Z \
--output json > aws-price.json
3. Pull current Google Cloud Spot pricing
Google Cloud recommends either the Cloud Billing Catalog API or Cloud Billing export. For repeatable automation, the pricing export is easier to query because it lands in a stable table named cloud_pricing_export.
bq query \
--use_legacy_sql=false \
'SELECT
pricing_as_of_time,
sku.description,
tiered_rates.usd_amount AS usd_amount
FROM `BILLING_EXPORT_DATASET.cloud_pricing_export`,
UNNEST(list_price.tiered_rates) AS tiered_rates
WHERE service.description = "Compute Engine"
AND LOWER(sku.description) LIKE "%spot predefined instance core%"
AND LOWER(sku.description) LIKE "%americas%"
ORDER BY pricing_as_of_time DESC
LIMIT 10' > gcp-price.json
This query is intentionally simple. Run it once, inspect the returned SKU descriptions, then tighten the filter for your machine family and region. The important point is architectural: use an official price feed, not a hand-maintained spreadsheet.
Step 2: Create a Portable Workload
The arbitrage controller only works if both clusters can run the exact same manifest. Keep the workload boring: no local disk assumptions, no cloud-specific load balancer requirement, and no session affinity.
apiVersion: v1
kind: Namespace
metadata:
name: arbitrage-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: queue-worker
namespace: arbitrage-demo
spec:
replicas: 0
selector:
matchLabels:
app: queue-worker
template:
metadata:
labels:
app: queue-worker
spec:
terminationGracePeriodSeconds: 15
containers:
- name: worker
image: REGISTRY/worker:TAG
env:
- name: QUEUE_URL
value: https://queue.example.internal/jobs
- name: WORKER_CONCURRENCY
value: "8"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
Apply the same manifest to both clusters.
kubectl --context=aws-spot apply -f workload.yaml
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke apply -f workload.yaml
replicas. That dramatically reduces blast radius compared to changing image, config, and placement at the same time.Step 3: Write the Arbitrage Controller
Now build a tiny controller. It reads both price feeds, applies a switching threshold, and pushes replicas to the winner. The example below uses a 12% savings threshold so you do not churn for pennies.
import json
import subprocess
from pathlib import Path
AWS_CONTEXT = "aws-spot"
GCP_CONTEXT = "gke_YOUR_PROJECT_ID_us-central1_tb-gke"
NAMESPACE = "arbitrage-demo"
DEPLOYMENT = "queue-worker"
DESIRED_REPLICAS = 4
SWITCH_DELTA = 0.12
STATE_FILE = Path("arbiter-state.json")
def load_aws_price():
data = json.loads(Path("aws-price.json").read_text())
prices = [float(x["SpotPrice"]) for x in data["SpotPriceHistory"]]
return min(prices)
def load_gcp_price():
rows = json.loads(Path("gcp-price.json").read_text())
return float(rows[0]["usd_amount"])
def winner(aws_price, gcp_price, last_winner):
if aws_price < gcp_price * (1 - SWITCH_DELTA):
return "aws"
if gcp_price < aws_price * (1 - SWITCH_DELTA):
return "gcp"
return last_winner or ("aws" if aws_price <= gcp_price else "gcp")
def scale(context, replicas):
patch = {
"spec": {
"replicas": replicas
}
}
subprocess.run([
"kubectl", "--context", context,
"-n", NAMESPACE,
"patch", f"deployment/{DEPLOYMENT}",
"--type=merge",
"-p", json.dumps(patch)
], check=True)
def main():
state = json.loads(STATE_FILE.read_text()) if STATE_FILE.exists() else {}
aws_price = load_aws_price()
gcp_price = load_gcp_price()
target = winner(aws_price, gcp_price, state.get("winner"))
if target == "aws":
scale(AWS_CONTEXT, DESIRED_REPLICAS)
scale(GCP_CONTEXT, 0)
else:
scale(GCP_CONTEXT, DESIRED_REPLICAS)
scale(AWS_CONTEXT, 0)
STATE_FILE.write_text(json.dumps({
"winner": target,
"aws_price": aws_price,
"gcp_price": gcp_price
}, indent=2))
print(json.dumps({
"winner": target,
"aws_price": aws_price,
"gcp_price": gcp_price
}, indent=2))
if __name__ == "__main__":
main()
This controller is intentionally minimal. It does not try to be smart about partial placement, GPU drift, or service mesh migration. It solves one valuable problem: move all replicas of a portable worker to the cheaper cloud only when the gap is large enough to matter.
Step 4: Schedule and Run It
4. Run the control loop every 30 minutes
Do not reevaluate every minute. Both AWS and Google Cloud can change capacity characteristics faster than your workload should move. A 30-minute interval is a good starting point for batch workers.
*/30 * * * * cd /opt/arbitrage && /usr/bin/python3 arbiter.py >> /var/log/arbiter.log 2>&1
5. Add the two guardrails that stop most outages
- A switching threshold: require at least 10-15% effective savings before moving.
- A hold window: keep the current winner for a few hours unless the price gap becomes extreme.
- Graceful shutdown: keep terminationGracePeriodSeconds realistic so workers finish or requeue jobs cleanly.
Verification and Expected Output
Your first validation is simple: one cluster should hold live replicas, the other should be cold. Then confirm the active side completed rollout successfully.
python3 arbiter.py
kubectl --context=aws-spot -n arbitrage-demo get deployment queue-worker
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke -n arbitrage-demo get deployment queue-worker
kubectl --context=aws-spot -n arbitrage-demo rollout status deployment/queue-worker --timeout=120s
kubectl --context=gke_YOUR_PROJECT_ID_us-central1_tb-gke -n arbitrage-demo rollout status deployment/queue-worker --timeout=120s
Expected output looks like this:
{
"winner": "gcp",
"aws_price": 0.0284,
"gcp_price": 0.0239
}
NAME READY UP-TO-DATE AVAILABLE AGE
queue-worker 0/0 0 0 12m
NAME READY UP-TO-DATE AVAILABLE AGE
queue-worker 4/4 4 4 12m
If that matches, the pattern is working. You now have a relocator that can drain one cloud and light up the other without rebuilding the app.
Troubleshooting Top 3
- Problem: the workload flips clouds too often. Fix: raise the threshold from 12% to 15% and add a minimum hold time in the state file before another move is allowed.
- Problem: Pods start in the target cloud but fail requests. Fix: check outbound firewall rules, queue reachability, DNS, and secrets parity before blaming the scheduler.
- Problem: relocation succeeds but jobs are lost mid-run. Fix: make the worker idempotent, checkpoint progress, and requeue unacked work before shutting down the old side.
What's Next
- Replace the simple price files with a real pipeline that joins price, interruption rate, and queue lag into a single placement score.
- Promote from active/cold to active/warm once you trust the economics and need faster cutover.
- Add cloud-specific penalties instead of raw price comparison alone. AWS Spot can interrupt with two minutes of notice; GKE Spot VMs terminate after a notice with a short shutdown window, so price is only part of the decision.
- Clean up and standardize your scripts before handing them to platform teams. For quick formatting passes, the Code Formatter is useful for shell and Python snippets.
Frequently Asked Questions
Can I use multi-cloud arbitrage for databases or Kafka clusters? +
How often should I relocate workloads between AWS and Google Cloud? +
What is the safest first workload for dynamic relocation? +
How do I prevent downtime during cloud relocation? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.