Agentic Observability Platforms [2026 Cheat Sheet]
Bottom Line
All three platforms now cover tracing, evaluation, and prompt iteration, so the real decision is workflow fit: LangSmith is the most integrated app-dev stack, Phoenix is the most OTEL-native and self-hosting-friendly, and W&B Weave is the cleanest bridge for teams already living in W&B.
Key Takeaways
- ›As of May 08, 2026, all three platforms support tracing plus evaluation workflows.
- ›Phoenix stands out for OTEL/OpenInference-first instrumentation and open-source self-hosting.
- ›LangSmith has the tightest built-in loop across tracing, datasets, prompts, and experiments.
- ›W&B Weave is strongest when your org already uses W&B projects, entities, and evaluation habits.
- ›If trace payloads may include secrets or PII, sanitize them before ingestion.
As of May 08, 2026, the practical gap between agentic observability platforms is no longer “who has tracing?” but “which operating model fits your team?” LangSmith, Arize Phoenix, and W&B Weave all cover traces and evals; the differences show up in instrumentation style, workflow shape, and how naturally each product fits your existing stack. This cheat sheet focuses on those decision points, plus copy-ready setup references you can filter live.
- All three platforms now cover tracing, evaluation, and prompt iteration loops.
- Phoenix is the most explicit bet on OpenTelemetry and OpenInference.
- LangSmith is the most opinionated end-to-end agent application workflow.
- Weave feels strongest when your team already operates inside W&B.
- Before shipping traces, scrub sensitive fields with a Data Masking Tool.
Decision matrix below is an inference from official docs, not vendor marketing copy.
| Dimension | LangSmith | Arize Phoenix | W&B Weave | Edge |
|---|---|---|---|---|
| Core posture | Integrated agent app dev platform | OTEL-native observability and evals | Observability plus evals in W&B workflow | Depends |
| Tracing model | Projects, traces, runs, threads | OTLP spans, projects, sessions | Ops, calls, traces inside W&B projects | Phoenix |
| Prompt iteration | Playground, prompt engineering, Studio | Prompt management, playground, replay | Playground and version tracking | LangSmith |
| Evaluation loop | Offline and online evaluation workflows | Evals plus datasets and experiments | Scorers, judges, and production feedback | Depends |
| Self-host posture | Cloud, hybrid, self-hosted options documented | Strong OSS and self-host story | Docs center on W&B account and project workflow | Phoenix |
| Best fit | App teams shipping agent products fast | Infra-minded teams wanting open instrumentation | Teams already standardized on W&B | Depends |
At a Glance
Bottom Line
LangSmith is the easiest all-in-one choice for many agent teams, Phoenix is the cleanest open instrumentation choice, and Weave is the right answer when observability should plug into existing W&B habits rather than replace them.
What the official docs make clear
- LangSmith explicitly organizes work around observability, evaluation, prompt engineering, and deployment.
- Phoenix explicitly centers tracing, evaluation, prompt engineering, and datasets/experiments on top of OpenTelemetry.
- Weave explicitly centers tracing, evaluations, version tracking, feedback, and production monitoring for LLM apps.
One naming trap to fix early
- In 2026, “Arize” is ambiguous in public docs: Phoenix is the open-source observability product, while Arize AX is the enterprise AI engineering platform.
- If your team says “we use Arize,” pin down whether they mean Phoenix or AX before you wire instrumentation.
When to Choose Each
Choose LangSmith when:
- You want tracing, datasets, evaluators, prompts, and experiments to live in one workflow.
- You are already building with LangChain or LangGraph and want fast instrumentation via environment variables and wrappers.
- You care about both offline evaluation and online evaluation in a single product model.
- You want a more app-centric workflow than a raw telemetry-centric one.
Choose Phoenix when:
- You want the most explicit OTLP, OpenTelemetry, and OpenInference path.
- You want open-source posture and documented self-hosting on Docker, Kubernetes, or your own cloud.
- You want prompt replay, datasets, and experiments without giving up an infra-friendly tracing model.
- Your team prefers instrumentation that is easy to move or standardize across providers.
Choose Weave when:
- Your org already uses Weights & Biases teams, entities, or evaluation workflows.
- You want function-level tracking with weave.op and project-centric logging via
weave.init(). - You want prompts, versions, traces, scorers, and feedback to sit next to the rest of your W&B work.
- You need a lighter bridge from LLM app tracing into an existing W&B culture.
Searchable Reference
Use the filter box to narrow commands, snippets, or platform notes. Keyboard shortcuts are wired below.
| Shortcut | Action |
|---|---|
/ | Focus the live filter |
Esc | Clear the filter and blur the input |
j | Jump to the next <h2> section |
k | Jump to the previous <h2> section |
Install and authenticate
LangSmith
pip install -U langsmith openai
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
export LANGSMITH_PROJECT="my-app"Phoenix
pip install arize-phoenix openinference-instrumentation-openai openaiW&B Weave
pip install weave openai
export WANDB_API_KEY="<your_api_key>"Basic tracing
LangSmith
from openai import OpenAI
from langsmith.wrappers import wrap_openai
from langsmith import traceable
client = wrap_openai(OpenAI())
@traceable
def assistant(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": question}],
)
return response.choices[0].message.contentPhoenix
from phoenix.otel import register
tracer_provider = register(
project_name="my-llm-app",
auto_instrument=True,
)W&B Weave
import weave
from openai import OpenAI
client = OpenAI()
weave.init('your-team/traces-quickstart')
@weave.op()
def ask_model(prompt: str):
return client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': prompt}],
)CLI and query workflows
LangSmith CLI
langsmith project list
langsmith trace list --project my-app --limit 5
langsmith run list --project my-app --run-type llm --include-metadata
langsmith dataset list
langsmith experiment list --dataset my-eval-setPhoenix TypeScript instrumentation
npm install @arizeai/openinference-instrumentation-openaiWeave TypeScript install
npm install weave openaiConfiguration
LangSmith configuration notes
LANGSMITH_TRACINGenables tracing.LANGSMITH_PROJECTroutes traces into a named project.LANGSMITH_WORKSPACE_IDis relevant when an API key is linked to multiple workspaces.LANGSMITH_ENDPOINTmatters for self-hosted or hybrid deployments.
Phoenix configuration notes
- Phoenix docs now strongly emphasize using
phoenix.oteland OTEL-aware defaults. auto_instrument=Trueactivates installed OpenInference instrumentors automatically.- For cloud setups, docs show
PHOENIX_API_KEYandPHOENIX_COLLECTOR_ENDPOINTas the core connection values. - If you batch spans, flush on shutdown so data is not left in the exporter queue.
Weave configuration notes
WANDB_API_KEYcan log you in non-interactively.weave.init('entity/project')is the project-routing primitive that matters first.WEAVE_PARALLELISMcontrols worker parallelism.WEAVE_PRINT_CALL_LINK=falsedisables terminal call-link output.
# Weave runtime environment variables
export WEAVE_PARALLELISM=10
export WEAVE_PRINT_CALL_LINK=false.env.example for tracing, eval, and provider keys removes a lot of false platform friction during trials.Advanced Usage
Evaluation strategy that travels across all three
- Start with a small hand-built dataset of failures, not a giant synthetic benchmark.
- Separate offline evaluation from online monitoring so regressions and live drift are not mixed together.
- Track retrieval quality, tool selection, formatting validity, and final answer quality as separate signals.
- Promote bad production traces into eval datasets quickly; all three platforms support some version of that feedback loop.
Where each platform pulls ahead in advanced workflows
- LangSmith: strongest when you want traces, evaluators, prompts, and experiment comparison in one application workflow.
- Phoenix: strongest when you want open instrumentation and portability around OTLP and OpenInference.
- Weave: strongest when you want LLM observability to feel like an extension of existing W&B evaluation and project habits.
Security and retention checklist
- Decide whether prompts, retrieved chunks, and tool arguments are safe to store before you enable tracing broadly.
- Mask secrets and identifiers at the edge, not after ingestion.
- Keep one documented policy for trace retention, evaluator retention, and dataset retention.
- If you share snippets across teams, run them through the Code Formatter before pasting into internal docs or runbooks.
Official Sources
Frequently Asked Questions
Is LangSmith or Arize Phoenix better for OpenTelemetry-first teams? +
phoenix.otel and OpenInference, while LangSmith is more application-workflow-centric.Does W&B Weave replace LangSmith for agent evaluation? +
What does 'Arize' mean in 2026: Phoenix or AX? +
Which platform is easiest to self-host for agent observability? +
What data should I avoid sending into observability traces? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.