Graviton5 Batch Orchestration on AWS [Deep Dive 2026]
Bottom Line
The reliable pattern is simple: keep AWS Batch queues architecture-pure, publish a real ARM64 image, and let Batch launch Amazon Linux 2023 ECS hosts on Graviton5-backed capacity. As of May 15, 2026, that usually means targeting preview M9g instances where your account and Region expose them, then falling back to M8g only when access blocks rollout.
Key Takeaways
- ›As of May 15, 2026, AWS documents M9g as a Graviton5 preview family.
- ›AWS Batch job queues cannot mix compute environments with different CPU architectures.
- ›Use ECS_AL2023 for new Batch EC2 environments; AL2 creation is blocked after June 30, 2026.
- ›Build and push your container explicitly for linux/arm64 to avoid exec format errors.
AWS Batch is already good at scaling embarrassingly parallel work, but many teams still leave efficiency on the table by treating CPU architecture as an afterthought. The better pattern is to make architecture an explicit scheduling boundary. On May 15, 2026, that means pairing AWS Batch with ARM64 containers and Graviton-backed EC2 capacity, then targeting M9g where your account has Graviton5 preview access.
- M9g offers up to 25% better compute performance than M8g, according to the AWS product page.
- AWS Batch requires all compute environments attached to one queue to share the same architecture.
- ECS_AL2023 is the default AMI family for new Batch EC2 environments and the right default for 2026 builds.
- If your container is not published for linux/arm64, the fleet design is correct but the job still fails.
Prerequisites
Prerequisites box
- An AWS account with AWS Batch, EC2, ECR, and CloudWatch Logs access.
- AWS CLI v2, Docker with buildx, and permissions to create an ECR repository.
- At least two subnets and one security group for the compute environment.
- An ECS instance profile such as
ecsInstanceRole, or your own equivalent instance profile ARN. - Access to M9g preview capacity in your Region, or willingness to swap to M8g as a temporary fallback.
Reference the official AWS docs for Amazon EC2 M9g instances, create-compute-environment, and ARM64 ECS workloads. If you want to clean up the JSON payloads you keep in your repo, run them through TechBytes' Code Formatter before committing.
Bottom Line
Use a dedicated ARM64 Batch queue, publish an explicit ARM64 container image, and let Batch launch ECS_AL2023 hosts on Graviton-backed EC2. On May 15, 2026, that is the cleanest way to turn Graviton5 efficiency into predictable batch throughput.
Step 1: Check Graviton5 Capacity
Confirm identity and instance availability
- Pick a Region and confirm which account you are using.
- Query EC2 for M9g offerings before you create any Batch resources.
- If the query returns nothing, keep the workflow and replace the instance family with M8g until preview access is enabled.
export AWS_REGION=us-east-1
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ec2 describe-instance-type-offerings \
--region "$AWS_REGION" \
--filters "Name=instance-type,Values=m9g.*" \
--query 'InstanceTypeOfferings[].InstanceType' \
--output textThis step matters because ARM orchestration is only as good as the capacity pools behind it. AWS also notes that for EC2-backed ARM workloads you should verify Region support with describe-instance-type-offerings before deployment.
Step 2: Build an ARM64 Image
Create a tiny batch workload
- Create a minimal program that prints the runtime architecture and current Batch job ID.
- Build for linux/arm64, not for your laptop architecture by accident.
- Push to Amazon ECR so Batch nodes can pull it directly.
mkdir -p batch-arm64-demo
cd batch-arm64-demo
cat > app.py <<'PY'
import json
import os
import platform
print(json.dumps({
"arch": platform.machine(),
"job_id": os.environ.get("AWS_BATCH_JOB_ID"),
"job_attempt": os.environ.get("AWS_BATCH_JOB_ATTEMPT")
}))
PY
cat > Dockerfile <<'EOF'
FROM python:3.12-slim
WORKDIR /app
COPY app.py .
CMD ["python", "/app/app.py"]
EOFaws ecr create-repository \
--repository-name batch-arm64-demo \
--region "$AWS_REGION"
aws ecr get-login-password --region "$AWS_REGION" | \
docker login --username AWS --password-stdin \
"$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com"
docker buildx create --name armbuilder --use
docker buildx inspect --bootstrap
docker buildx build \
--platform linux/arm64 \
-t "$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/batch-arm64-demo:latest" \
--push .Step 3: Create Batch Resources
Create an ARM-only compute environment
AWS Batch states that all compute environments attached to one queue must share the same architecture, so the clean design is one ARM queue for ARM jobs. Also note the current AWS warning: new ECS compute environments should use ECS_AL2023, not AL2.
cat > compute-environment.json <<'JSON'
{
"computeEnvironmentName": "graviton5-batch-ce",
"type": "MANAGED",
"state": "ENABLED",
"computeResources": {
"type": "EC2",
"allocationStrategy": "BEST_FIT_PROGRESSIVE",
"minvCpus": 0,
"maxvCpus": 128,
"instanceTypes": ["m9g.large", "m9g.xlarge", "m9g.2xlarge"],
"subnets": ["subnet-0123456789abcdef0", "subnet-0123456789abcdef1"],
"securityGroupIds": ["sg-0123456789abcdef0"],
"instanceRole": "ecsInstanceRole",
"ec2Configuration": [
{ "imageType": "ECS_AL2023" }
],
"tags": {
"Name": "graviton5-batch"
}
}
}
JSON
aws batch create-compute-environment \
--cli-input-json file://compute-environment.json \
--region "$AWS_REGION"Create the job queue
export COMPUTE_ENV_ARN=$(aws batch describe-compute-environments \
--compute-environments graviton5-batch-ce \
--region "$AWS_REGION" \
--query 'computeEnvironments[0].computeEnvironmentArn' \
--output text)
cat > job-queue.json <<JSON
{
"jobQueueName": "graviton5-batch-queue",
"state": "ENABLED",
"priority": 10,
"computeEnvironmentOrder": [
{
"order": 1,
"computeEnvironment": "$COMPUTE_ENV_ARN"
}
],
"jobQueueType": "ECS"
}
JSON
aws batch create-job-queue \
--cli-input-json file://job-queue.json \
--region "$AWS_REGION"If you do not have M9g preview access, change every m9g entry in the compute environment file to m8g. Do not mix architectures in the same queue as a workaround; AWS explicitly disallows that queue design.
Step 4: Submit and Verify
Register the job definition
cat > job-definition.json <<JSON
{
"jobDefinitionName": "arm64-demo",
"type": "container",
"platformCapabilities": ["EC2"],
"containerProperties": {
"image": "$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/batch-arm64-demo:latest",
"command": ["python", "/app/app.py"],
"resourceRequirements": [
{ "type": "VCPU", "value": "1" },
{ "type": "MEMORY", "value": "2048" }
]
}
}
JSON
aws batch register-job-definition \
--cli-input-json file://job-definition.json \
--region "$AWS_REGION"Submit the job
export JOB_ID=$(aws batch submit-job \
--job-name arm64-demo-1 \
--job-queue graviton5-batch-queue \
--job-definition arm64-demo \
--region "$AWS_REGION" \
--query jobId \
--output text)
echo "$JOB_ID"Verification and expected output
- Wait for the compute environment to become VALID.
- Wait for the job status to move from SUBMITTED to RUNNING and then SUCCEEDED.
- Check the container output for
aarch64.
aws batch describe-compute-environments \
--compute-environments graviton5-batch-ce \
--region "$AWS_REGION" \
--query 'computeEnvironments[0].[status,state,statusReason]' \
--output table
aws batch describe-jobs \
--jobs "$JOB_ID" \
--region "$AWS_REGION" \
--query 'jobs[0].[status,container.exitCode]' \
--output tableExpected results:
- The compute environment reports VALID and ENABLED.
- The job finishes with SUCCEEDED and exit code 0.
- Your application output includes
"arch": "aarch64", proving the job executed on an ARM64 runtime.
Troubleshooting and What's Next
Troubleshooting top 3
- Compute environment is INVALID: Check
statusReason, confirm subnet and security group reachability, and verify that your account actually sees M9g offerings in the target Region. - Job is stuck in RUNNABLE: The queue may reference a compute environment that is not yet VALID, your requested vCPU or memory may not fit the allowed instance sizes, or your service quotas may be too low.
- Container fails with an architecture error: Rebuild and repush with
docker buildx build --platform linux/arm64 --push. This is the most common breakage when a team migrates the fleet before migrating the image.
What's next
- Add a second ARM queue that uses SPOT capacity for fault-tolerant single-node jobs. AWS recommends SPOTPRICECAPACITY_OPTIMIZED for Spot compute resources.
- Split job queues by latency class, not just by team. High-priority queues can map to the same ARM architecture but different capacity policies.
- Add a launch template only when you need deeper host tuning. Otherwise, let Batch keep selecting the latest supported ECS_AL2023 AMI.
- Standardize architecture in CI so every merge publishes both linux/arm64 and linux/amd64 images, even if Batch only consumes the ARM tag today.
The main lesson is that Graviton efficiency does not come from the instance family alone. It comes from making architecture visible in your build pipeline, queue boundaries, and AMI selection policy.
Frequently Asked Questions
Can AWS Batch mix ARM64 and x86 compute environments in one job queue? +
Do I need a custom AMI to run Graviton5 jobs on AWS Batch? +
Why is my AWS Batch job stuck in RUNNABLE on M9g? +
describe-instance-type-offerings, the compute environment statusReason, and your requested resourceRequirements.Should I use Spot for Graviton-based batch processing? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
AWS Batch vs EKS for ML Pipelines
Where Batch wins on operational simplicity and where EKS still makes sense.
Developer ToolsECR Image Strategy for Multi-Arch Containers
A practical guide to publishing amd64 and arm64 images without tag chaos.
System ArchitectureECS AL2023 Migration Playbook
A step-by-step migration path away from legacy Amazon Linux 2 container hosts.