Agent Observability Checklist [Developer Cheat Sheet]
Bottom Line
Agent observability is complete only when traces, tool logs, cost meters, and replay bundles connect to the same workflow ID. That shared ID turns vague failures into reproducible engineering work.
Key Takeaways
- ›One trace per user task; spans for model calls, tools, guardrails, handoffs, and retries.
- ›Cost attribution belongs on model spans, not only monthly provider invoices.
- ›Replay bundles need sanitized inputs, fixed fixtures, prompt versions, and assertions.
- ›Use OTLP for portable export, then add vendor-specific search and eval features.
Agent observability is the operating checklist for systems that reason, call tools, spend tokens, and fail in ways ordinary HTTP logs cannot explain. This reference gives you a production-ready map for June 18, 2026: trace every agent step, capture structured tool logs, meter cost per request, and save replay bundles that reproduce failures without exposing private data.
- Capture one trace per user task, with spans for model calls, tool calls, guardrails, and handoffs.
- Keep prompts, tool inputs, and outputs redacted by default; store replay fixtures separately from secrets.
- Meter tokens and wall time at the span level so routing, retries, and tools can be cost-attributed.
- Use OTLP for portable telemetry, then add vendor features for search, evals, and replay.
Agent Observability Checklist
Bottom Line
An observable agent has a trace tree, correlated tool logs, cost meters, and a sanitized replay bundle for every important failure path. If any one is missing, debugging turns into guesswork.
Minimum Signals
- Trace root: one workflow trace per user request, job, conversation turn, or scheduled task.
- Span taxonomy: create spans for planning, model calls, retrieval, tool calls, guardrails, handoffs, retries, and final response formatting.
- Tool log: record tool name, arguments hash, permission scope, exit status, duration, result size, and sanitized error text.
- Cost meter: attach input tokens, output tokens, cached tokens when available, model name, provider, retry count, and route choice to the model span.
- Replay bundle: preserve prompt template version, normalized inputs, retrieved document IDs, tool fixtures, model settings, and seed or deterministic controls when supported.
- Privacy gate: redact secrets, credentials, personal data, raw customer payloads, and proprietary context before telemetry leaves the process.
Use portable telemetry first. The OpenTelemetry OTLP exporter configuration defines endpoints, headers, timeouts, and protocols for traces, metrics, and logs. Framework-specific tracing can sit on top: OpenAI Agents SDK tracing records generations, tool calls, handoffs, guardrails, and custom events, while LangSmith tracing provides project-based trace inspection.
Live Search JS Filter
Cheat sheets work best when engineers can narrow the list fast. Drop this filter into an internal runbook page and tag each row with the signal, owner, and severity.
<label for='obs-filter'>Filter checklist</label>
<input id='obs-filter' type='search' placeholder='trace, cost, replay, tool log' autocomplete='off'>
<ul id='obs-list'>
<li data-tags='trace model cost'>Model span includes provider, model, tokens, latency, retry count</li>
<li data-tags='tool log replay'>Tool call stores sanitized args hash, exit status, fixture pointer</li>
<li data-tags='privacy security'>Exporter redacts secrets before transport</li>
</ul>
<script>
const input = document.querySelector('#obs-filter');
const items = [...document.querySelectorAll('#obs-list li')];
input.addEventListener('input', () => {
const q = input.value.trim().toLowerCase();
for (const item of items) {
const text = `${item.textContent} ${item.dataset.tags}`.toLowerCase();
item.hidden = q && !text.includes(q);
}
});
</script>Filterable Fields
- Signal: trace, metric, log, event, replay, eval, alert.
- Owner: platform, AI engineering, security, data, product, support.
- Severity: blocker, high, medium, low, hygiene.
- Runtime: Node.js, Python, browser, worker, batch, queue consumer.
Keyboard Shortcuts Table
For an internal trace explorer, wire shortcuts to navigation and replay actions. Keep destructive actions behind confirmation and respect focused form fields.
| Shortcut | Action | Use when |
|---|---|---|
/ | Focus trace search | Jump from a failure report to a trace ID, user ID hash, or workflow name. |
j / k | Next or previous span | Move through a trace tree without losing detail-pane focus. |
e | Open error span | Skip straight to the first failed tool, model, or guardrail span. |
c | Copy trace link | Paste a stable permalink into an incident, pull request, or support ticket. |
r | Open replay bundle | Inspect the sanitized fixture that reproduces the failure. |
? | Show shortcut help | Expose shortcuts without putting instructional text in the main workflow. |
Commands Grouped By Purpose
Install And Bootstrap
These commands use documented OpenTelemetry JavaScript setup and standard shell tools. Replace placeholder endpoints and keys with your own values.
npm install --save @opentelemetry/api
npm install --save @opentelemetry/auto-instrumentations-nodeexport OTEL_SERVICE_NAME='agent-worker'
export OTEL_TRACES_EXPORTER='otlp'
export OTEL_EXPORTER_OTLP_ENDPOINT='http://localhost:4318'
export OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'
export NODE_OPTIONS='--require @opentelemetry/auto-instrumentations-node/register'
node app.jsTrace Lookup
- By trace ID: search the trace backend for the root trace, then inspect model and tool child spans.
- By request ID: correlate application logs to the trace root using a shared request identifier.
- By workflow: group traces by workflow name to compare planner, retrieval, and tool latency across releases.
TRACE_ID='trace_abc123'
rg "$TRACE_ID" ./logs ./replaysTool Log Triage
jq 'select(.type == "tool_call" and .status != "ok") | {trace_id, tool, duration_ms, status, error}' agent-events.jsonlCost Rollups
jq -s '
map(select(.type == "model_call"))
| group_by(.model)
| map({
model: .[0].model,
calls: length,
input_tokens: map(.input_tokens // 0) | add,
output_tokens: map(.output_tokens // 0) | add
})
' agent-events.jsonlReplay Bundle Creation
mkdir -p replays/$TRACE_ID
jq --arg trace "$TRACE_ID" 'select(.trace_id == $trace)' agent-events.jsonl > replays/$TRACE_ID/events.jsonl
cp prompts/customer-support.current.md replays/$TRACE_ID/prompt.md
cp fixtures/retrieval-results.json replays/$TRACE_ID/retrieval.jsonConfiguration
Environment Variables
| Variable | Purpose | Notes |
|---|---|---|
OTEL_SERVICE_NAME | Names the emitting service | Use stable names such as agent-api, agent-worker, or agent-evals. |
OTEL_EXPORTER_OTLP_ENDPOINT | Sets the base OTLP endpoint | Official defaults are SDK-dependent; set it explicitly per environment. |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT | Overrides trace export endpoint | Use when traces, metrics, and logs route to different collectors. |
OTEL_EXPORTER_OTLP_HEADERS | Adds exporter headers | Keep tokens in a secret manager, not in committed config. |
OTEL_EXPORTER_OTLP_TIMEOUT | Controls exporter timeout in milliseconds | Set low enough that telemetry cannot stall the agent hot path. |
OTEL_LOG_LEVEL | Controls OpenTelemetry diagnostic logging | Use debug briefly during instrumentation work; production should stay quiet. |
LANGSMITH_TRACING | Enables LangSmith tracing | Set to true when using LangSmith projects. |
OPENAI_AGENTS_DISABLE_TRACING | Disables OpenAI Agents SDK tracing | Set to 1 only when policy or environment requires it. |
Span Attribute Checklist
- Identity: trace ID, span ID, parent span ID, workflow name, tenant hash, request hash.
- Model: provider, model name, temperature, max output setting, input tokens, output tokens, finish reason.
- Tool: tool name, version, permission scope, argument schema version, exit code, result bytes.
- Retriever: index name, query hash, top-k setting, returned document IDs, score range.
- Policy: guardrail name, decision, blocked reason, human review status.
- Release: app version, prompt version, routing policy version, deployment region.
Advanced Usage
Failure Replay Contract
A replay bundle should be deterministic enough for debugging and sanitized enough for broad engineering access. Treat it as a contract between production, CI, and incident review.
- Manifest: include trace ID, created time, app version, prompt version, model settings, and owners.
- Inputs: store normalized user input after masking and policy classification.
- Context: save retrieval document IDs and fixed snippets, not a live query that can drift.
- Tools: capture fixture responses for external APIs, file reads, database rows, and permission checks.
- Assertions: define expected failure, expected fix behavior, or expected guardrail decision.
{
"trace_id": "trace_abc123",
"workflow": "support_refund_agent",
"release_channel": "current",
"prompt_version": "customer-support.current",
"model": "configured-by-runtime",
"fixtures": {
"retrieval": "retrieval.json",
"tools": "tools.jsonl"
},
"assertions": [
"refund_tool is not called without order ownership",
"final_response includes escalation path"
]
}Alerting Rules
- Cost spike: page when cost per successful workflow exceeds the rolling budget threshold.
- Tool failure: alert on a rising error rate for tools that mutate state or affect money movement.
- Replay gap: open a ticket when a failed production trace has no replay bundle after the retention delay.
- Trace break: alert when root traces lack model or tool child spans after a deploy.
- Privacy violation: block export and notify security when telemetry contains forbidden key patterns.
Review Cadence
- Review top cost traces weekly and decide whether routing, caching, or prompt shape should change.
- Sample failed tool traces daily until tool contracts and permissions stabilize.
- Run replay bundles before model, prompt, retriever, and tool schema changes.
- Audit redaction rules monthly with security and support examples.
- Delete expired telemetry and replay fixtures according to retention policy.
Frequently Asked Questions
What should I trace in an AI agent? +
How do I log agent tool calls safely? +
Where should token cost tracking live? +
What is failure replay for agents? +
Should I use OpenTelemetry or a dedicated LLM observability tool? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.