LLM Eval Harness for Daily Content Automation [2026]
Bottom Line
A repo-owned eval harness catches factual drift and style regressions before daily AI-generated content reaches production. Start with deterministic checks, then add an LLM judge only for subjective editorial quality.
Key Takeaways
- ›Use deterministic checks for dates, prices, versions, headings, and banned phrases.
- ›Keep fixtures as contracts, not full word-for-word snapshots.
- ›Use structured JSON from an LLM judge for auditable style scoring.
- ›Fail CI before publishing when factual or brand-voice regressions appear.
Daily content automation only works when yesterday's generator cannot silently become today's liability. A small LLM eval harness gives editors and engineers a repeatable way to catch factual drift, tone changes, missing citations, and schema breaks before posts ship. This tutorial builds a lightweight Python harness that runs against saved fixtures, checks deterministic rules first, uses an LLM judge only where judgment is needed, and emits a CI-friendly report for every scheduled content run.
Prerequisites
Bottom Line
Own the eval harness in your repo, treat test cases like production content, and fail the build on factual or brand-voice regressions. Do not depend on a hosted eval product that may change faster than your publishing workflow.
Start with a narrow harness that tests the claims your automation is most likely to damage: names, dates, product limits, required disclaimers, and house style. Keep the first suite boring and explicit.
Prerequisites box
- Python 3.11+ installed locally or in CI.
- An
OPENAI_API_KEYenvironment variable if you use the optional LLM judge. - A content generator that can write candidate output to disk as Markdown, HTML, or JSON.
- A small set of golden briefs, source facts, and approved sample outputs.
- A private fixture workflow. When fixtures contain customers, drafts, or analytics, mask them first with TechBytes' Data Masking Tool.
1. Create the fixture set
Each fixture should represent one recurring content job: a daily product roundup, a release note, a newsletter intro, or a social summary. Store the source facts beside the expected style constraints so reviewers can update both in one pull request.
content-evals/
fixtures/
ai-roundup.json
devtools-release.json
eval_harness.py
requirements.txt
Create one fixture:
{
"id": "ai-roundup-2026-06-11",
"prompt": "Write a 500 word daily AI engineering roundup.",
"facts": {
"date": "June 11, 2026",
"company": "ExampleDB",
"version": "2.8",
"pricing": "$29 per seat"
},
"style": {
"tone": "precise, neutral, engineering-led",
"forbidden_phrases": ["game changer", "revolutionary", "unlock your potential"],
"required_sections": ["What changed", "Why it matters", "Migration notes"]
},
"candidate_path": "outputs/ai-roundup.md"
}
The fixture is intentionally not a snapshot of every word. It is a contract for what must stay true while the generator evolves.
2. Add deterministic checks
Run cheap checks before calling an LLM judge. Deterministic checks are faster, easier to debug, and better for hard requirements.
import json
import re
from pathlib import Path
def load_fixture(path):
return json.loads(Path(path).read_text())
def check_required_facts(text, facts):
failures = []
for key, value in facts.items():
if str(value) not in text:
failures.append(f"missing fact {key}: {value}")
return failures
def check_forbidden_phrases(text, phrases):
lowered = text.lower()
return [f"forbidden phrase: {p}" for p in phrases if p.lower() in lowered]
def check_required_sections(text, sections):
failures = []
for section in sections:
pattern = rf"(^|\n)#+\s+{re.escape(section)}"
if not re.search(pattern, text, re.IGNORECASE):
failures.append(f"missing section: {section}")
return failures
Use deterministic checks for:
- Exact dates, version numbers, prices, legal notices, and product names.
- Required headings or JSON keys.
- Forbidden phrases, unsupported claims, and banned comparative language.
- Length limits, link counts, and required source attribution patterns.
3. Add an LLM judge for style
Style is harder to grade with regular expressions. Use an LLM judge for subjective checks, but force the judge to return structured JSON so the harness can compare scores and reasons consistently. OpenAI's Structured Outputs documentation describes JSON Schema-constrained responses, which is the right shape for this job. The sample uses gpt-5-mini, a documented lower-latency model option suitable for this kind of focused review.
from openai import OpenAI
client = OpenAI()
STYLE_SCHEMA = {
"type": "object",
"properties": {
"passes": {"type": "boolean"},
"score": {"type": "number", "minimum": 0, "maximum": 1},
"reason": {"type": "string"}
},
"required": ["passes", "score", "reason"],
"additionalProperties": False
}
def judge_style(text, style):
response = client.responses.create(
model="gpt-5-mini",
input=[{
"role": "user",
"content": "Evaluate whether this article follows the requested style.\n"
f"Style: {json.dumps(style)}\n\nArticle:\n{text}"
}],
text={
"format": {
"type": "json_schema",
"name": "style_eval",
"schema": STYLE_SCHEMA,
"strict": True
}
}
)
return json.loads(response.output_text)
Treat the judge_style method as a reviewer, not an oracle. Keep its decision auditable by storing the prompt, model name, score, and reason in the report.
4. Wire the runner and verify output
The runner loads each fixture, reads the candidate output, executes deterministic checks, optionally runs the judge, and exits non-zero when the content fails. That makes it usable in cron, GitHub Actions, Buildkite, or any other daily automation.
import argparse
import sys
def run_fixture(path, use_judge):
fixture = load_fixture(path)
text = Path(fixture["candidate_path"]).read_text()
failures = []
failures += check_required_facts(text, fixture["facts"])
failures += check_forbidden_phrases(text, fixture["style"]["forbidden_phrases"])
failures += check_required_sections(text, fixture["style"]["required_sections"])
judge_result = None
if use_judge and not failures:
judge_result = judge_style(text, fixture["style"])
if not judge_result["passes"] or judge_result["score"] < 0.85:
failures.append(f"style judge failed: {judge_result['reason']}")
return {
"id": fixture["id"],
"passed": not failures,
"failures": failures,
"judge": judge_result
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("fixtures", nargs="+")
parser.add_argument("--judge", action="store_true")
args = parser.parse_args()
results = [run_fixture(path, args.judge) for path in args.fixtures]
print(json.dumps({"results": results}, indent=2))
return 0 if all(r["passed"] for r in results) else 1
if __name__ == "__main__":
sys.exit(main())
Run it locally with the --judge flag:
python eval_harness.py fixtures/ai-roundup.json --judge
Verification and expected output
A passing run should produce a compact JSON report and exit with status 0:
{
"results": [
{
"id": "ai-roundup-2026-06-11",
"passed": true,
"failures": [],
"judge": {
"passes": true,
"score": 0.91,
"reason": "The article is neutral, specific, and avoids promotional phrasing."
}
}
]
}
A failing run should be blunt enough for an editor or engineer to fix without reading the whole candidate:
{
"results": [
{
"id": "ai-roundup-2026-06-11",
"passed": false,
"failures": [
"missing fact version: 2.8",
"forbidden phrase: game changer"
],
"judge": null
}
]
}
Before enabling the harness in CI, intentionally break one fixture and confirm three things:
- The process exits with a non-zero status.
- The report names the exact failed rule.
- The content job stops before publishing or queuing downstream assets.
Troubleshooting and what's next
Top 3 troubleshooting issues
- The judge is flaky: Lower temperature if your wrapper exposes it, tighten the rubric, and require the judge to cite a specific sentence for every failure.
- Fact checks miss paraphrases: Keep exact checks for values that must appear verbatim, then add separate semantic checks only for claims that can be expressed multiple ways.
- CI is too slow: Run deterministic checks on every commit, run the LLM judge on scheduled builds, and cache candidate outputs by fixture hash.
What's next
- Add severity levels such as blocker, warning, and editorial note.
- Track pass rate over time so prompt changes show measurable quality impact.
- Add source freshness checks for volatile facts such as prices, API limits, and release dates.
- Export reports to your content dashboard so editors can review failures without opening CI logs.
Frequently Asked Questions
How do I evaluate AI-generated content before publishing? +
Should LLM evals use exact matches or an LLM judge? +
What should go into a content automation eval fixture? +
How do I stop flaky LLM judge results? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Structured Outputs in Production: Schema, Retry, Validate
A production guide to schema-constrained AI responses, validation paths, and retry handling.
System ArchitectureAIOps 2026: Beyond Vibe Checks to Engineering Rigor
How prompt CI, observability, gateways, and evaluation discipline fit into modern AI operations.
Developer ToolsRubber-Duck Critic Agent for Code Review Workflows
Use structured JSON critique and scoring to make AI-assisted review workflows auditable.