Home Posts LLM Eval Harness for Daily Content Automation [2026]
AI Engineering

LLM Eval Harness for Daily Content Automation [2026]

LLM Eval Harness for Daily Content Automation [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 11, 2026 · 7 min read

Bottom Line

A repo-owned eval harness catches factual drift and style regressions before daily AI-generated content reaches production. Start with deterministic checks, then add an LLM judge only for subjective editorial quality.

Key Takeaways

  • Use deterministic checks for dates, prices, versions, headings, and banned phrases.
  • Keep fixtures as contracts, not full word-for-word snapshots.
  • Use structured JSON from an LLM judge for auditable style scoring.
  • Fail CI before publishing when factual or brand-voice regressions appear.

Daily content automation only works when yesterday's generator cannot silently become today's liability. A small LLM eval harness gives editors and engineers a repeatable way to catch factual drift, tone changes, missing citations, and schema breaks before posts ship. This tutorial builds a lightweight Python harness that runs against saved fixtures, checks deterministic rules first, uses an LLM judge only where judgment is needed, and emits a CI-friendly report for every scheduled content run.

Prerequisites

Bottom Line

Own the eval harness in your repo, treat test cases like production content, and fail the build on factual or brand-voice regressions. Do not depend on a hosted eval product that may change faster than your publishing workflow.

Start with a narrow harness that tests the claims your automation is most likely to damage: names, dates, product limits, required disclaimers, and house style. Keep the first suite boring and explicit.

Watch out: OpenAI's hosted Evals platform is documented as becoming read-only on October 31, 2026 and shutting down on November 30, 2026. Use provider eval concepts, but keep this harness portable.

Prerequisites box

  • Python 3.11+ installed locally or in CI.
  • An OPENAI_API_KEY environment variable if you use the optional LLM judge.
  • A content generator that can write candidate output to disk as Markdown, HTML, or JSON.
  • A small set of golden briefs, source facts, and approved sample outputs.
  • A private fixture workflow. When fixtures contain customers, drafts, or analytics, mask them first with TechBytes' Data Masking Tool.

1. Create the fixture set

Each fixture should represent one recurring content job: a daily product roundup, a release note, a newsletter intro, or a social summary. Store the source facts beside the expected style constraints so reviewers can update both in one pull request.

content-evals/
  fixtures/
    ai-roundup.json
    devtools-release.json
  eval_harness.py
  requirements.txt

Create one fixture:

{
  "id": "ai-roundup-2026-06-11",
  "prompt": "Write a 500 word daily AI engineering roundup.",
  "facts": {
    "date": "June 11, 2026",
    "company": "ExampleDB",
    "version": "2.8",
    "pricing": "$29 per seat"
  },
  "style": {
    "tone": "precise, neutral, engineering-led",
    "forbidden_phrases": ["game changer", "revolutionary", "unlock your potential"],
    "required_sections": ["What changed", "Why it matters", "Migration notes"]
  },
  "candidate_path": "outputs/ai-roundup.md"
}

The fixture is intentionally not a snapshot of every word. It is a contract for what must stay true while the generator evolves.

2. Add deterministic checks

Run cheap checks before calling an LLM judge. Deterministic checks are faster, easier to debug, and better for hard requirements.

import json
import re
from pathlib import Path


def load_fixture(path):
    return json.loads(Path(path).read_text())


def check_required_facts(text, facts):
    failures = []
    for key, value in facts.items():
        if str(value) not in text:
            failures.append(f"missing fact {key}: {value}")
    return failures


def check_forbidden_phrases(text, phrases):
    lowered = text.lower()
    return [f"forbidden phrase: {p}" for p in phrases if p.lower() in lowered]


def check_required_sections(text, sections):
    failures = []
    for section in sections:
        pattern = rf"(^|\n)#+\s+{re.escape(section)}"
        if not re.search(pattern, text, re.IGNORECASE):
            failures.append(f"missing section: {section}")
    return failures

Use deterministic checks for:

  • Exact dates, version numbers, prices, legal notices, and product names.
  • Required headings or JSON keys.
  • Forbidden phrases, unsupported claims, and banned comparative language.
  • Length limits, link counts, and required source attribution patterns.

3. Add an LLM judge for style

Style is harder to grade with regular expressions. Use an LLM judge for subjective checks, but force the judge to return structured JSON so the harness can compare scores and reasons consistently. OpenAI's Structured Outputs documentation describes JSON Schema-constrained responses, which is the right shape for this job. The sample uses gpt-5-mini, a documented lower-latency model option suitable for this kind of focused review.

from openai import OpenAI

client = OpenAI()

STYLE_SCHEMA = {
    "type": "object",
    "properties": {
        "passes": {"type": "boolean"},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "reason": {"type": "string"}
    },
    "required": ["passes", "score", "reason"],
    "additionalProperties": False
}


def judge_style(text, style):
    response = client.responses.create(
        model="gpt-5-mini",
        input=[{
            "role": "user",
            "content": "Evaluate whether this article follows the requested style.\n"
                       f"Style: {json.dumps(style)}\n\nArticle:\n{text}"
        }],
        text={
            "format": {
                "type": "json_schema",
                "name": "style_eval",
                "schema": STYLE_SCHEMA,
                "strict": True
            }
        }
    )
    return json.loads(response.output_text)

Treat the judge_style method as a reviewer, not an oracle. Keep its decision auditable by storing the prompt, model name, score, and reason in the report.

4. Wire the runner and verify output

The runner loads each fixture, reads the candidate output, executes deterministic checks, optionally runs the judge, and exits non-zero when the content fails. That makes it usable in cron, GitHub Actions, Buildkite, or any other daily automation.

import argparse
import sys


def run_fixture(path, use_judge):
    fixture = load_fixture(path)
    text = Path(fixture["candidate_path"]).read_text()

    failures = []
    failures += check_required_facts(text, fixture["facts"])
    failures += check_forbidden_phrases(text, fixture["style"]["forbidden_phrases"])
    failures += check_required_sections(text, fixture["style"]["required_sections"])

    judge_result = None
    if use_judge and not failures:
        judge_result = judge_style(text, fixture["style"])
        if not judge_result["passes"] or judge_result["score"] < 0.85:
            failures.append(f"style judge failed: {judge_result['reason']}")

    return {
        "id": fixture["id"],
        "passed": not failures,
        "failures": failures,
        "judge": judge_result
    }


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("fixtures", nargs="+")
    parser.add_argument("--judge", action="store_true")
    args = parser.parse_args()

    results = [run_fixture(path, args.judge) for path in args.fixtures]
    print(json.dumps({"results": results}, indent=2))
    return 0 if all(r["passed"] for r in results) else 1


if __name__ == "__main__":
    sys.exit(main())

Run it locally with the --judge flag:

python eval_harness.py fixtures/ai-roundup.json --judge

Verification and expected output

A passing run should produce a compact JSON report and exit with status 0:

{
  "results": [
    {
      "id": "ai-roundup-2026-06-11",
      "passed": true,
      "failures": [],
      "judge": {
        "passes": true,
        "score": 0.91,
        "reason": "The article is neutral, specific, and avoids promotional phrasing."
      }
    }
  ]
}

A failing run should be blunt enough for an editor or engineer to fix without reading the whole candidate:

{
  "results": [
    {
      "id": "ai-roundup-2026-06-11",
      "passed": false,
      "failures": [
        "missing fact version: 2.8",
        "forbidden phrase: game changer"
      ],
      "judge": null
    }
  ]
}

Before enabling the harness in CI, intentionally break one fixture and confirm three things:

  • The process exits with a non-zero status.
  • The report names the exact failed rule.
  • The content job stops before publishing or queuing downstream assets.

Troubleshooting and what's next

Top 3 troubleshooting issues

  1. The judge is flaky: Lower temperature if your wrapper exposes it, tighten the rubric, and require the judge to cite a specific sentence for every failure.
  2. Fact checks miss paraphrases: Keep exact checks for values that must appear verbatim, then add separate semantic checks only for claims that can be expressed multiple ways.
  3. CI is too slow: Run deterministic checks on every commit, run the LLM judge on scheduled builds, and cache candidate outputs by fixture hash.

What's next

  • Add severity levels such as blocker, warning, and editorial note.
  • Track pass rate over time so prompt changes show measurable quality impact.
  • Add source freshness checks for volatile facts such as prices, API limits, and release dates.
  • Export reports to your content dashboard so editors can review failures without opening CI logs.

Frequently Asked Questions

How do I evaluate AI-generated content before publishing? +
Build a fixture-based eval harness that runs before the publish step. Use deterministic checks for required facts and structure, then add an LLM judge for style or editorial judgment.
Should LLM evals use exact matches or an LLM judge? +
Use exact matches for values that must not change, such as dates, prices, and version numbers. Use an LLM judge for tone, clarity, and usefulness, and require structured output so failures are machine-readable.
What should go into a content automation eval fixture? +
Include the prompt, source facts, required sections, forbidden phrases, and the path to the generated candidate. Keep fixtures small enough that an editor can review changes in a pull request.
How do I stop flaky LLM judge results? +
Make the rubric narrower, force JSON output, and ask the judge to cite the sentence that caused a failure. For release-critical checks, prefer deterministic rules over subjective grading.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.