Home Posts Agent Observability Checklist: Traces, Logs, Replay
Developer Reference

Agent Observability Checklist: Traces, Logs, Replay

Agent Observability Checklist: Traces, Logs, Replay
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 04, 2026 · 11 min read

Bottom Line

Instrument the agent loop as a workflow, not a transcript. Correlate spans, tool logs, cost counters, and replay-safe fixtures before production traffic scales.

Key Takeaways

  • One root span per agent run keeps traces readable and replayable.
  • Tool logs need status, duration, error class, argument hash, and side-effect marker.
  • Cost meters should include model, tool, retry, and total workflow spend.
  • Replay bundles must preserve versions and hashes without storing raw secrets.
  • GenAI semantic conventions are useful, but pin fields while they remain development-stage.

Production agents need more than a trace ID. They need span context for every reasoning step, structured tool logs, cost meters that survive retries, and replay fixtures that let engineers reproduce failures without exposing user data. This checklist is a breadth-first reference for instrumenting agent systems in June 2026, using OpenTelemetry-compatible signals where possible and keeping vendor-specific tracing as an optional layer.

Observability Checklist

Bottom Line

Instrument the agent loop as a workflow, not a chat transcript. The minimum viable setup is correlated traces, sanitized tool logs, token and money counters, and replayable failure bundles.

Minimum signal set

  • Trace one root span per user task, then child spans for planning, retrieval, model calls, tool calls, validation, and final response.
  • Log every tool invocation with stable tool name, arguments hash, status, duration, error class, and side-effect marker.
  • Meter input tokens, output tokens, cached tokens, tool spend, retries, and total cost per completed workflow.
  • Attach replay metadata: prompt template version, model identifier, retrieval corpus version, feature flags, and tool contract version.
  • Keep raw prompts, tool arguments, and outputs opt-in; sanitize them before long-lived storage with a privacy pass such as the TechBytes Data Masking Tool.

Span naming checklist

  • agent.run: root span for one user-visible task.
  • agent.plan: model or policy step that chooses the next action.
  • agent.tool: wrapper span around each external or internal tool call.
  • agent.validate: deterministic checks, guardrails, schema validation, or policy review.
  • agent.repair: retry or self-correction loop with attempt number and stop reason.

Live Search JS Filter

Use this tiny filter on internal runbooks, incident notes, or trace field catalogs. It is intentionally dependency-free so it works inside docs sites, static pages, and local replay reports.

<input id='obs-filter' type='search' placeholder='Filter checklist...' aria-label='Filter checklist'>
<ul id='obs-list'>
  <li data-tags='trace span root agent'>Create one root span per agent run</li>
  <li data-tags='tool log retry error'>Log tool name, duration, status, and error class</li>
  <li data-tags='cost token meter budget'>Meter token cost and tool cost per workflow</li>
  <li data-tags='replay fixture prompt model'>Store replay-safe prompt and config versions</li>
</ul>
<script>
const filter = document.querySelector('#obs-filter');
const rows = [...document.querySelectorAll('#obs-list li')];
filter.addEventListener('input', () => {
  const q = filter.value.trim().toLowerCase();
  rows.forEach((row) => {
    const text = `${row.textContent} ${row.dataset.tags}`.toLowerCase();
    row.hidden = q && !text.includes(q);
  });
});
</script>

Keyboard shortcuts for review mode

ShortcutActionUse when
/Focus the trace or checklist filterJumping between fields during incident review
jMove to next span, log row, or replay stepScanning a long agent run chronologically
kMove to previous span, log row, or replay stepBacktracking after a failed tool call
rOpen the replay panel for the selected failureReproducing a planner, retrieval, or validation miss
cCopy trace ID, run ID, or replay fixture pathSharing evidence in an issue or incident channel
EscClear search or close the active panelReturning to the full run timeline

Commands Grouped By Purpose

Start local OTLP collection

The OpenTelemetry Collector accepts OTLP traffic and can fan out traces, metrics, and logs to your backend. Replace the image tag with the version your platform has approved.

docker run --rm \
  -p 4317:4317 \
  -p 4318:4318 \
  -v "$PWD/otel-collector.yaml:/etc/otelcol/config.yaml" \
  otel/opentelemetry-collector:latest \
  --config=/etc/otelcol/config.yaml

Export agent telemetry settings

export OTEL_SERVICE_NAME='checkout-agent'
export OTEL_EXPORTER_OTLP_ENDPOINT='http://localhost:4318'
export OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'
export OTEL_RESOURCE_ATTRIBUTES='deployment.environment=dev,service.namespace=agents'

Capture a replay bundle

mkdir -p replay/failed-runs
cp traces/run-0187.json replay/failed-runs/
cp configs/agent.runtime.json replay/failed-runs/
cp fixtures/retrieval-snapshot.json replay/failed-runs/
sha256sum replay/failed-runs/* > replay/failed-runs/MANIFEST.sha256

Inspect a run quickly

jq '.spans[] | {name, status, duration_ms, tool: .attributes.tool_name}' traces/run-0187.json
jq '.cost | {input_tokens, output_tokens, tool_usd, total_usd}' traces/run-0187.json
Watch out: Do not store raw user prompts or tool arguments just to make replay easier. Store sanitized payloads, hashes, versions, and explicit opt-in samples.

Configuration

Collector pipeline

This skeleton keeps the shape simple: receive OTLP, batch records, and export to a backend. Add processors for redaction, sampling, and routing before production.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch: {}
exporters:
  debug: {}
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Agent run schema

{
  "run_id": "run-0187",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "user_task_hash": "sha256:...",
  "agent_version": "2026-06-04.1",
  "model": "provider-model-id",
  "prompt_template_version": "support-router:v12",
  "retrieval_snapshot": "kb-index:2026-06-04",
  "cost": {
    "input_tokens": 1840,
    "output_tokens": 612,
    "tool_usd": 0.03,
    "total_usd": 0.08
  }
}

Cost meter fields

  • input_tokens: tokens sent to the model after prompt assembly and retrieval.
  • output_tokens: tokens returned by the model before post-processing.
  • cached_tokens: provider-reported cache hits when available.
  • tool_usd: billable cost from search, browser, database, code execution, or third-party APIs.
  • retry_usd: incremental cost caused by repair loops, validation failures, or transient tool errors.
  • total_usd: workflow-level total, not just the final model call.

Advanced Usage

Failure replay workflow

  1. Start from the failed root span and confirm the user-visible failure, not just an internal warning.
  2. Load the replay bundle with sanitized prompt inputs, prompt template version, model settings, feature flags, and retrieval snapshot.
  3. Run tools in dry-run mode when side effects are possible.
  4. Compare original and replay spans by status, duration, selected action, validation result, and cost delta.
  5. Promote the failure into a regression fixture when it represents a product risk, not a one-off outage.

Sampling strategy

  • Keep all failed runs, policy escalations, high-cost runs, and user-visible incidents.
  • Sample routine successful runs by tenant, route, model, and agent version.
  • Use tail sampling when possible so the decision can see latency, status, and error attributes from the whole trace.
  • Cap prompt and output capture separately from span sampling; telemetry volume and privacy risk are different controls.

OpenTelemetry notes for GenAI agents

  • OpenTelemetry Semantic Conventions 1.41.0 include GenAI conventions, but the GenAI pages are still marked development, so pin your emitted field set in code review.
  • Use stable service and deployment resource attributes first; agent-specific attributes should not break standard trace correlation.
  • Prefer low-cardinality attributes for dashboards and high-cardinality values as events, logs, or sampled payload references.
  • Document any opt-in prompt or output capture with retention, access control, and redaction rules.
Pro tip: Treat replay fixtures like tests: version them, hash them, redact them, and delete them when the product risk is gone.

Quick Reference

Sticky ToC fields

  • run_id: application-level workflow identifier.
  • trace_id: distributed trace identifier used by the observability backend.
  • span_id: single operation inside the trace.
  • tool_name: stable name from the approved tool registry.
  • attempt: retry or repair loop count.
  • failure_class: normalized class such as validationerror, timeout, permissiondenied, or policy_blocked.

Review checklist before launch

  • Every agent run creates one root span and one durable run ID.
  • Every tool call has timeout, status, duration, arguments hash, and side-effect marker.
  • Every cost meter reports model, tool, retry, and total workflow cost.
  • Every replay bundle excludes raw secrets and unnecessary personal data.
  • Every dashboard can segment by agent version, route, tenant tier, and deployment environment.

Frequently Asked Questions

What should I trace in an AI agent run? +
Trace the whole workflow, not only the model call. Use one root span for the user task, then child spans for planning, retrieval, tool execution, validation, repair loops, and final response.
How do I log agent tool calls without leaking sensitive data? +
Log stable metadata such as tool_name, status, duration, error class, side-effect marker, and a hash of arguments. Store raw arguments only through an explicit opt-in path with redaction, retention limits, and access controls.
What cost metrics matter for agent observability? +
Track input tokens, output tokens, cached tokens, tool spend, retry cost, and total workflow cost. The key number is cost per completed task because retries and tools can dominate the final model call.
How do failure replay fixtures differ from traces? +
A trace explains what happened during a run. A replay fixture contains the sanitized inputs, config versions, model settings, retrieval snapshot, and tool contracts needed to reproduce the failure later.
Can I use OpenTelemetry for AI agent observability? +
Yes, OpenTelemetry is a practical base for traces, metrics, logs, and OTLP export. For GenAI-specific fields, treat current semantic conventions as evolving and pin the exact attributes your services emit.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.