Agent Observability Checklist: Traces, Logs, Replay
Bottom Line
Instrument the agent loop as a workflow, not a transcript. Correlate spans, tool logs, cost counters, and replay-safe fixtures before production traffic scales.
Key Takeaways
- ›One root span per agent run keeps traces readable and replayable.
- ›Tool logs need status, duration, error class, argument hash, and side-effect marker.
- ›Cost meters should include model, tool, retry, and total workflow spend.
- ›Replay bundles must preserve versions and hashes without storing raw secrets.
- ›GenAI semantic conventions are useful, but pin fields while they remain development-stage.
Production agents need more than a trace ID. They need span context for every reasoning step, structured tool logs, cost meters that survive retries, and replay fixtures that let engineers reproduce failures without exposing user data. This checklist is a breadth-first reference for instrumenting agent systems in June 2026, using OpenTelemetry-compatible signals where possible and keeping vendor-specific tracing as an optional layer.
Observability Checklist
Bottom Line
Instrument the agent loop as a workflow, not a chat transcript. The minimum viable setup is correlated traces, sanitized tool logs, token and money counters, and replayable failure bundles.
Minimum signal set
- Trace one root span per user task, then child spans for planning, retrieval, model calls, tool calls, validation, and final response.
- Log every tool invocation with stable tool name, arguments hash, status, duration, error class, and side-effect marker.
- Meter input tokens, output tokens, cached tokens, tool spend, retries, and total cost per completed workflow.
- Attach replay metadata: prompt template version, model identifier, retrieval corpus version, feature flags, and tool contract version.
- Keep raw prompts, tool arguments, and outputs opt-in; sanitize them before long-lived storage with a privacy pass such as the TechBytes Data Masking Tool.
Span naming checklist
- agent.run: root span for one user-visible task.
- agent.plan: model or policy step that chooses the next action.
- agent.tool: wrapper span around each external or internal tool call.
- agent.validate: deterministic checks, guardrails, schema validation, or policy review.
- agent.repair: retry or self-correction loop with attempt number and stop reason.
Live Search JS Filter
Use this tiny filter on internal runbooks, incident notes, or trace field catalogs. It is intentionally dependency-free so it works inside docs sites, static pages, and local replay reports.
<input id='obs-filter' type='search' placeholder='Filter checklist...' aria-label='Filter checklist'>
<ul id='obs-list'>
<li data-tags='trace span root agent'>Create one root span per agent run</li>
<li data-tags='tool log retry error'>Log tool name, duration, status, and error class</li>
<li data-tags='cost token meter budget'>Meter token cost and tool cost per workflow</li>
<li data-tags='replay fixture prompt model'>Store replay-safe prompt and config versions</li>
</ul>
<script>
const filter = document.querySelector('#obs-filter');
const rows = [...document.querySelectorAll('#obs-list li')];
filter.addEventListener('input', () => {
const q = filter.value.trim().toLowerCase();
rows.forEach((row) => {
const text = `${row.textContent} ${row.dataset.tags}`.toLowerCase();
row.hidden = q && !text.includes(q);
});
});
</script>Keyboard shortcuts for review mode
| Shortcut | Action | Use when |
|---|---|---|
/ | Focus the trace or checklist filter | Jumping between fields during incident review |
j | Move to next span, log row, or replay step | Scanning a long agent run chronologically |
k | Move to previous span, log row, or replay step | Backtracking after a failed tool call |
r | Open the replay panel for the selected failure | Reproducing a planner, retrieval, or validation miss |
c | Copy trace ID, run ID, or replay fixture path | Sharing evidence in an issue or incident channel |
Esc | Clear search or close the active panel | Returning to the full run timeline |
Commands Grouped By Purpose
Start local OTLP collection
The OpenTelemetry Collector accepts OTLP traffic and can fan out traces, metrics, and logs to your backend. Replace the image tag with the version your platform has approved.
docker run --rm \
-p 4317:4317 \
-p 4318:4318 \
-v "$PWD/otel-collector.yaml:/etc/otelcol/config.yaml" \
otel/opentelemetry-collector:latest \
--config=/etc/otelcol/config.yamlExport agent telemetry settings
export OTEL_SERVICE_NAME='checkout-agent'
export OTEL_EXPORTER_OTLP_ENDPOINT='http://localhost:4318'
export OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'
export OTEL_RESOURCE_ATTRIBUTES='deployment.environment=dev,service.namespace=agents'Capture a replay bundle
mkdir -p replay/failed-runs
cp traces/run-0187.json replay/failed-runs/
cp configs/agent.runtime.json replay/failed-runs/
cp fixtures/retrieval-snapshot.json replay/failed-runs/
sha256sum replay/failed-runs/* > replay/failed-runs/MANIFEST.sha256Inspect a run quickly
jq '.spans[] | {name, status, duration_ms, tool: .attributes.tool_name}' traces/run-0187.json
jq '.cost | {input_tokens, output_tokens, tool_usd, total_usd}' traces/run-0187.jsonConfiguration
Collector pipeline
This skeleton keeps the shape simple: receive OTLP, batch records, and export to a backend. Add processors for redaction, sampling, and routing before production.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch: {}
exporters:
debug: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]Agent run schema
{
"run_id": "run-0187",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"user_task_hash": "sha256:...",
"agent_version": "2026-06-04.1",
"model": "provider-model-id",
"prompt_template_version": "support-router:v12",
"retrieval_snapshot": "kb-index:2026-06-04",
"cost": {
"input_tokens": 1840,
"output_tokens": 612,
"tool_usd": 0.03,
"total_usd": 0.08
}
}Cost meter fields
- input_tokens: tokens sent to the model after prompt assembly and retrieval.
- output_tokens: tokens returned by the model before post-processing.
- cached_tokens: provider-reported cache hits when available.
- tool_usd: billable cost from search, browser, database, code execution, or third-party APIs.
- retry_usd: incremental cost caused by repair loops, validation failures, or transient tool errors.
- total_usd: workflow-level total, not just the final model call.
Advanced Usage
Failure replay workflow
- Start from the failed root span and confirm the user-visible failure, not just an internal warning.
- Load the replay bundle with sanitized prompt inputs, prompt template version, model settings, feature flags, and retrieval snapshot.
- Run tools in dry-run mode when side effects are possible.
- Compare original and replay spans by status, duration, selected action, validation result, and cost delta.
- Promote the failure into a regression fixture when it represents a product risk, not a one-off outage.
Sampling strategy
- Keep all failed runs, policy escalations, high-cost runs, and user-visible incidents.
- Sample routine successful runs by tenant, route, model, and agent version.
- Use tail sampling when possible so the decision can see latency, status, and error attributes from the whole trace.
- Cap prompt and output capture separately from span sampling; telemetry volume and privacy risk are different controls.
OpenTelemetry notes for GenAI agents
- OpenTelemetry Semantic Conventions 1.41.0 include GenAI conventions, but the GenAI pages are still marked development, so pin your emitted field set in code review.
- Use stable service and deployment resource attributes first; agent-specific attributes should not break standard trace correlation.
- Prefer low-cardinality attributes for dashboards and high-cardinality values as events, logs, or sampled payload references.
- Document any opt-in prompt or output capture with retention, access control, and redaction rules.
Quick Reference
Sticky ToC fields
run_id: application-level workflow identifier.trace_id: distributed trace identifier used by the observability backend.span_id: single operation inside the trace.tool_name: stable name from the approved tool registry.attempt: retry or repair loop count.failure_class: normalized class such as validationerror, timeout, permissiondenied, or policy_blocked.
Review checklist before launch
- Every agent run creates one root span and one durable run ID.
- Every tool call has timeout, status, duration, arguments hash, and side-effect marker.
- Every cost meter reports model, tool, retry, and total workflow cost.
- Every replay bundle excludes raw secrets and unnecessary personal data.
- Every dashboard can segment by agent version, route, tenant tier, and deployment environment.
Frequently Asked Questions
What should I trace in an AI agent run? +
How do I log agent tool calls without leaking sensitive data? +
tool_name, status, duration, error class, side-effect marker, and a hash of arguments. Store raw arguments only through an explicit opt-in path with redaction, retention limits, and access controls.What cost metrics matter for agent observability? +
How do failure replay fixtures differ from traces? +
Can I use OpenTelemetry for AI agent observability? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
OpenTelemetry 2026: The Unified Observability Standard
A systems-level guide to OTel signals, collectors, semantic conventions, and vendor-neutral telemetry pipelines.
Developer ReferenceOpenTelemetry GenAI Agent SemConv Cheat Sheet [2026]
A focused reference for naming GenAI spans, metrics, and attributes in agent systems.
AI EngineeringAI Agent Reliability Patterns [Engineering Deep Dive]
A practical deep dive into validation, rollback, recovery, and cost controls for production agents.