Home Posts Terraform Self-Healing Infrastructure [How-To 2026]
Cloud Infrastructure

Terraform Self-Healing Infrastructure [How-To 2026]

Terraform Self-Healing Infrastructure [How-To 2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 14, 2026 · 9 min read

Bottom Line

The reliable pattern is not “let the model fix prod.” It is “let Terraform define the blast radius, let the LLM classify the incident, and only allow a tiny, audited set of remediation actions.”

Key Takeaways

  • Use terraform plan -detailed-exitcode to detect actionable drift in automation.
  • Parse plans with terraform show -json; never scrape human CLI text.
  • Keep the LLM on a strict allowlist: noop, refresh-only, apply saved plan, or -replace.
  • Use -refresh-only instead of deprecated terraform refresh.
  • Ship redacted plan context to the model and log every decision with the saved plan file.

Terraform already gives you the most important primitive for self-healing: a deterministic diff between declared and real infrastructure. The LLM should not invent infrastructure changes; it should classify a narrow failure signal, choose from a small remediation menu, and hand execution back to Terraform. In this tutorial, you will build that control loop using Terraform v1.14.x, machine-readable plan output, and a policy-bounded remediation agent.

  • Use terraform plan -detailed-exitcode to detect actionable drift in automation.
  • Parse plans with terraform show -json; never scrape terminal text.
  • Keep the model on a strict allowlist: noop, refresh-only, apply saved plan, or -replace.
  • Use -refresh-only instead of deprecated terraform refresh.

Prerequisites

Before you start

  • Terraform CLI v1.14.x or newer in the 1.14 line.
  • An AWS account and credentials already configured for the CLI.
  • Python 3.11+ for the remediation worker.
  • jq for lightweight JSON inspection.
  • A remote Terraform backend for team-safe state locking.
  • An LLM endpoint that supports schema-constrained JSON output.

Bottom Line

A self-healing loop is safe only when the model chooses from predefined actions and Terraform still owns the final diff. Treat the LLM as a classifier, not as an unrestricted operator.

1. Define the healing loop

The architecture is simple: detect drift, serialize the plan, classify the fix, then execute one approved action. That separation matters because Terraform is excellent at reconciliation, while the model is useful for interpreting context like alerts, runbooks, and recent changes.

  1. Run terraform plan -detailed-exitcode -out=tfplan on a schedule or from an alert trigger.
  2. If the exit code is 0, there is no change and the workflow exits.
  3. If the exit code is 2, convert the saved plan with terraform show -json tfplan.
  4. Send a redacted summary plus operational context to the model.
  5. Allow the model to return only one of four actions: noop, refresh_state, revert_drift, or replace_instance.
  6. Execute the matching Terraform command and persist logs, plan JSON, and the model decision together.
Watch out: Do not send full state, secrets, or customer identifiers to the model. Redact first. If you need a quick browser-side scrubber for identifiers and payload samples, the TechBytes Data Masking Tool fits this workflow well.

2. Author recoverable Terraform

For a concrete demo, we will manage a tiny EC2-based app and a security group. The healing use case is straightforward: if someone manually opens 22/tcp on the security group, Terraform detects the drift and the agent chooses to reapply the saved plan.

terraform {
  required_version = ">= 1.14.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_security_group" "app" {
  name        = "tb-self-healing-app"
  description = "App ingress"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "app" {
  ami                    = "ami-xxxxxxxxxxxxxxxxx"
  instance_type          = "t3.micro"
  vpc_security_group_ids = [aws_security_group.app.id]

  tags = {
    Name = "tb-self-healing-app"
  }
}

output "instance_id" {
  value = aws_instance.app.id
}

Apply it once, then intentionally create drift in the console by adding an SSH ingress rule. The important point is not the EC2 example itself; it is the workflow: the desired state is explicit, drift is externally introduced, and Terraform can prove the exact correction.

3. Build the remediation agent

The agent should read only what it needs: plan JSON, a short alert description, and a hardcoded allowlist. Keep business logic outside the model. The model decides among approved actions; your code validates and executes them.

#!/usr/bin/env bash
set -euo pipefail

terraform init -input=false
set +e
terraform plan -detailed-exitcode -out=tfplan
PLAN_EXIT=$?
set -e

if [ "$PLAN_EXIT" -eq 0 ]; then
  echo '{"status":"clean"}'
  exit 0
fi

if [ "$PLAN_EXIT" -ne 2 ]; then
  echo "plan failed" >&2
  exit 1
fi

terraform show -json tfplan > plan.json
python3 agent.py
import json
from pathlib import Path

ALLOWED_ACTIONS = {
    "noop",
    "refresh_state",
    "revert_drift",
    "replace_instance",
}
ALLOWED_REPLACE_TARGETS = {"aws_instance.app"}

plan = json.loads(Path("plan.json").read_text())
changes = [
    {
        "address": rc["address"],
        "actions": rc["change"]["actions"],
    }
    for rc in plan.get("resource_changes", [])
]

prompt_payload = {
    "incident": "Security group drift detected from scheduled plan run.",
    "resource_changes": changes,
    "allowed_actions": sorted(ALLOWED_ACTIONS),
    "replace_targets": sorted(ALLOWED_REPLACE_TARGETS),
    "policy": [
        "Prefer noop if confidence is low.",
        "Use revert_drift for config drift that matches declared Terraform.",
        "Use refresh_state only for accepted out-of-band changes.",
        "Use replace_instance only for the allowed target and only when health is degraded.",
    ],
}

# Replace this with your provider call. Require strict JSON output.
decision = {
    "action": "revert_drift",
    "target": None,
    "reason": "Ingress drift exists on a managed security group.",
}

if decision["action"] not in ALLOWED_ACTIONS:
    raise SystemExit("Rejected: action not allowed")

if decision["action"] == "replace_instance" and decision.get("target") not in ALLOWED_REPLACE_TARGETS:
    raise SystemExit("Rejected: replace target not allowed")

Path("decision.json").write_text(json.dumps(decision, indent=2))
print(json.dumps(decision))

If you use OpenAI, pair the worker with Structured Outputs so the response must match your decision schema. That removes most of the brittle parsing work. If you want to normalize the mixed Python and shell snippets before committing them, the TechBytes Code Formatter is a practical cleanup step.

4. Run and verify

Now add a tiny executor that maps each approved action to one Terraform command. Note that the model never constructs raw shell commands.

#!/usr/bin/env bash
set -euo pipefail
ACTION=$(jq -r '.action' decision.json)
TARGET=$(jq -r '.target // empty' decision.json)

case "$ACTION" in
  noop)
    echo "No action taken"
    ;;
  refresh_state)
    terraform apply -refresh-only -auto-approve
    ;;
  revert_drift)
    terraform apply -auto-approve tfplan
    ;;
  replace_instance)
    terraform apply -replace="$TARGET" -auto-approve
    ;;
  *)
    echo "Rejected unknown action" >&2
    exit 1
    ;;
esac

Verification and expected output

  1. Introduce drift by manually adding an SSH ingress rule.
  2. Run the detector script.
  3. Confirm that decision.json contains revert_drift.
  4. Run the executor and inspect the apply summary.
Terraform will perform the following actions:
  # aws_security_group.app will be updated in-place
  ~ resource "aws_security_group" "app" {
      ...
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

A healthy run leaves three useful artifacts: the saved plan file, the JSON decision, and the apply log. Together they make post-incident review much easier than a free-form chatbot transcript.

Troubleshooting: top 3 issues

  • The workflow keeps choosing refresh_state: your prompt is probably treating out-of-band changes as acceptable by default. Tighten policy text so declared configuration wins unless a human-approved exception exists.
  • The plan says changes exist, but the wrong resource gets targeted: do not let the model choose arbitrary addresses. Validate against a static allowlist such as aws_instance.app.
  • State updates feel risky: avoid deprecated terraform refresh. Use -refresh-only so the run is explicit and reviewable.
Pro tip: Start with low-impact remediations like reverting security-group drift or replacing a stateless node. Leave databases, IAM, and network topology changes behind a manual approval gate.

What's next

Once this pattern works for a single service, expand it carefully rather than making the model more autonomous.

  • Add a policy tier that requires manual approval for destructive plans or changes touching iam, kms, or data stores.
  • Feed in alert context from CloudWatch, Datadog, or PagerDuty so the model can distinguish drift from runtime faults.
  • Store decision records in your incident system and measure false positives, mean time to remediation, and rollback rate.
  • Graduate from single-instance demos to stateless fleets, where -replace is safer and more operationally realistic.

The key design choice never changes: Terraform remains the source of truth, and the LLM remains a bounded decision layer. That is how you get infrastructure that feels self-healing without turning production into an improvisation engine.

Frequently Asked Questions

Can Terraform really be used for self-healing infrastructure? +
Yes, but mostly as a reconciliation engine, not as a real-time process supervisor. Terraform is strong when the failure mode is drift, missing resources, or an approved replacement path that can be expressed as a plan.
Why use terraform show -json instead of parsing CLI text? +
The terminal output is designed for humans and can change formatting over time. terraform show -json gives you a machine-readable representation of the saved plan, which is far safer for automation and LLM context building.
Should I let an LLM run terraform apply directly? +
Not directly. Put the model behind a strict schema, an action allowlist, and a command mapper so it can choose only from preapproved remediation paths. The execution layer should still validate targets and keep an audit trail.
Is terraform refresh still the right command for drift workflows? +
No. HashiCorp documents terraform refresh as deprecated and recommends refresh-only mode on terraform plan or terraform apply instead. That makes state reconciliation explicit and easier to review.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.