Terraform Self-Healing Infrastructure [How-To 2026]
Bottom Line
The reliable pattern is not “let the model fix prod.” It is “let Terraform define the blast radius, let the LLM classify the incident, and only allow a tiny, audited set of remediation actions.”
Key Takeaways
- ›Use terraform plan -detailed-exitcode to detect actionable drift in automation.
- ›Parse plans with terraform show -json; never scrape human CLI text.
- ›Keep the LLM on a strict allowlist: noop, refresh-only, apply saved plan, or -replace.
- ›Use -refresh-only instead of deprecated terraform refresh.
- ›Ship redacted plan context to the model and log every decision with the saved plan file.
Terraform already gives you the most important primitive for self-healing: a deterministic diff between declared and real infrastructure. The LLM should not invent infrastructure changes; it should classify a narrow failure signal, choose from a small remediation menu, and hand execution back to Terraform. In this tutorial, you will build that control loop using Terraform v1.14.x, machine-readable plan output, and a policy-bounded remediation agent.
- Use terraform plan -detailed-exitcode to detect actionable drift in automation.
- Parse plans with terraform show -json; never scrape terminal text.
- Keep the model on a strict allowlist: noop, refresh-only, apply saved plan, or -replace.
- Use -refresh-only instead of deprecated terraform refresh.
Prerequisites
Before you start
- Terraform CLI v1.14.x or newer in the 1.14 line.
- An AWS account and credentials already configured for the CLI.
- Python 3.11+ for the remediation worker.
- jq for lightweight JSON inspection.
- A remote Terraform backend for team-safe state locking.
- An LLM endpoint that supports schema-constrained JSON output.
Bottom Line
A self-healing loop is safe only when the model chooses from predefined actions and Terraform still owns the final diff. Treat the LLM as a classifier, not as an unrestricted operator.
1. Define the healing loop
The architecture is simple: detect drift, serialize the plan, classify the fix, then execute one approved action. That separation matters because Terraform is excellent at reconciliation, while the model is useful for interpreting context like alerts, runbooks, and recent changes.
- Run
terraform plan -detailed-exitcode -out=tfplanon a schedule or from an alert trigger. - If the exit code is 0, there is no change and the workflow exits.
- If the exit code is 2, convert the saved plan with
terraform show -json tfplan. - Send a redacted summary plus operational context to the model.
- Allow the model to return only one of four actions:
noop,refresh_state,revert_drift, orreplace_instance. - Execute the matching Terraform command and persist logs, plan JSON, and the model decision together.
2. Author recoverable Terraform
For a concrete demo, we will manage a tiny EC2-based app and a security group. The healing use case is straightforward: if someone manually opens 22/tcp on the security group, Terraform detects the drift and the agent chooses to reapply the saved plan.
terraform {
required_version = ">= 1.14.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
resource "aws_security_group" "app" {
name = "tb-self-healing-app"
description = "App ingress"
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_instance" "app" {
ami = "ami-xxxxxxxxxxxxxxxxx"
instance_type = "t3.micro"
vpc_security_group_ids = [aws_security_group.app.id]
tags = {
Name = "tb-self-healing-app"
}
}
output "instance_id" {
value = aws_instance.app.id
}Apply it once, then intentionally create drift in the console by adding an SSH ingress rule. The important point is not the EC2 example itself; it is the workflow: the desired state is explicit, drift is externally introduced, and Terraform can prove the exact correction.
3. Build the remediation agent
The agent should read only what it needs: plan JSON, a short alert description, and a hardcoded allowlist. Keep business logic outside the model. The model decides among approved actions; your code validates and executes them.
#!/usr/bin/env bash
set -euo pipefail
terraform init -input=false
set +e
terraform plan -detailed-exitcode -out=tfplan
PLAN_EXIT=$?
set -e
if [ "$PLAN_EXIT" -eq 0 ]; then
echo '{"status":"clean"}'
exit 0
fi
if [ "$PLAN_EXIT" -ne 2 ]; then
echo "plan failed" >&2
exit 1
fi
terraform show -json tfplan > plan.json
python3 agent.pyimport json
from pathlib import Path
ALLOWED_ACTIONS = {
"noop",
"refresh_state",
"revert_drift",
"replace_instance",
}
ALLOWED_REPLACE_TARGETS = {"aws_instance.app"}
plan = json.loads(Path("plan.json").read_text())
changes = [
{
"address": rc["address"],
"actions": rc["change"]["actions"],
}
for rc in plan.get("resource_changes", [])
]
prompt_payload = {
"incident": "Security group drift detected from scheduled plan run.",
"resource_changes": changes,
"allowed_actions": sorted(ALLOWED_ACTIONS),
"replace_targets": sorted(ALLOWED_REPLACE_TARGETS),
"policy": [
"Prefer noop if confidence is low.",
"Use revert_drift for config drift that matches declared Terraform.",
"Use refresh_state only for accepted out-of-band changes.",
"Use replace_instance only for the allowed target and only when health is degraded.",
],
}
# Replace this with your provider call. Require strict JSON output.
decision = {
"action": "revert_drift",
"target": None,
"reason": "Ingress drift exists on a managed security group.",
}
if decision["action"] not in ALLOWED_ACTIONS:
raise SystemExit("Rejected: action not allowed")
if decision["action"] == "replace_instance" and decision.get("target") not in ALLOWED_REPLACE_TARGETS:
raise SystemExit("Rejected: replace target not allowed")
Path("decision.json").write_text(json.dumps(decision, indent=2))
print(json.dumps(decision))If you use OpenAI, pair the worker with Structured Outputs so the response must match your decision schema. That removes most of the brittle parsing work. If you want to normalize the mixed Python and shell snippets before committing them, the TechBytes Code Formatter is a practical cleanup step.
4. Run and verify
Now add a tiny executor that maps each approved action to one Terraform command. Note that the model never constructs raw shell commands.
#!/usr/bin/env bash
set -euo pipefail
ACTION=$(jq -r '.action' decision.json)
TARGET=$(jq -r '.target // empty' decision.json)
case "$ACTION" in
noop)
echo "No action taken"
;;
refresh_state)
terraform apply -refresh-only -auto-approve
;;
revert_drift)
terraform apply -auto-approve tfplan
;;
replace_instance)
terraform apply -replace="$TARGET" -auto-approve
;;
*)
echo "Rejected unknown action" >&2
exit 1
;;
esacVerification and expected output
- Introduce drift by manually adding an SSH ingress rule.
- Run the detector script.
- Confirm that
decision.jsoncontainsrevert_drift. - Run the executor and inspect the apply summary.
Terraform will perform the following actions:
# aws_security_group.app will be updated in-place
~ resource "aws_security_group" "app" {
...
}
Plan: 0 to add, 1 to change, 0 to destroy.
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.A healthy run leaves three useful artifacts: the saved plan file, the JSON decision, and the apply log. Together they make post-incident review much easier than a free-form chatbot transcript.
Troubleshooting: top 3 issues
- The workflow keeps choosing
refresh_state: your prompt is probably treating out-of-band changes as acceptable by default. Tighten policy text so declared configuration wins unless a human-approved exception exists. - The plan says changes exist, but the wrong resource gets targeted: do not let the model choose arbitrary addresses. Validate against a static allowlist such as
aws_instance.app. - State updates feel risky: avoid deprecated
terraform refresh. Use -refresh-only so the run is explicit and reviewable.
What's next
Once this pattern works for a single service, expand it carefully rather than making the model more autonomous.
- Add a policy tier that requires manual approval for destructive plans or changes touching
iam,kms, or data stores. - Feed in alert context from CloudWatch, Datadog, or PagerDuty so the model can distinguish drift from runtime faults.
- Store decision records in your incident system and measure false positives, mean time to remediation, and rollback rate.
- Graduate from single-instance demos to stateless fleets, where -replace is safer and more operationally realistic.
The key design choice never changes: Terraform remains the source of truth, and the LLM remains a bounded decision layer. That is how you get infrastructure that feels self-healing without turning production into an improvisation engine.
Frequently Asked Questions
Can Terraform really be used for self-healing infrastructure? +
Why use terraform show -json instead of parsing CLI text?
+
terraform show -json gives you a machine-readable representation of the saved plan, which is far safer for automation and LLM context building.
Should I let an LLM run terraform apply directly?
+
Is terraform refresh still the right command for drift workflows?
+
terraform refresh as deprecated and recommends refresh-only mode on terraform plan or terraform apply instead. That makes state reconciliation explicit and easier to review.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.