Agent Observability: Tracing Decisions and Tool Calls

Standard application performance monitoring was built for HTTP requests, database queries, and external API calls. An agent run produces a different signal shape: a tree of decisions, with tool calls nested inside model calls, with token accounting per branch, with hallucination signals that look nothing like ordinary errors. By 2026 the observability stack for agents has matured. The OpenTelemetry GenAI semantic conventions are stable. Tools like Langfuse, Helicone, and Application Insights GenAI integration capture what matters.

What Agent Observability Has to Capture

Trace structure. The user request as the root span; each model call as a child span; each tool call as a grandchild. Visualised, the trace is a tree.
Token accounting per span. Input, output, cached. Aggregated to per-task and per-tenant totals.
Tool call detail. Tool name, arguments, result, duration. Failure mode classification.
Decision points. Which tools the model considered, which it picked, the confidence implied by stop reasons.
Hallucination signals. When the model claims information not present in the retrieved context, when citation references documents the retrieval did not return, when numeric values diverge from source content.
Cost attribution. Per task, per feature, per tenant. Critical for unit economics.

OpenTelemetry GenAI Semantic Conventions

The OpenTelemetry community has standardised attribute names for generative AI workloads. The conventions are stable as of late 2025 and supported by the major observability backends. Using them rather than vendor-specific naming makes the telemetry portable.

// Standard GenAI span attributes
{
 'gen_ai.system': 'anthropic', // or 'openai', 'azure-openai', etc.
 'gen_ai.request.model': 'claude-opus-4-7',
 'gen_ai.request.max_tokens': 1024,
 'gen_ai.response.model': 'claude-opus-4-7',
 'gen_ai.response.finish_reasons': ['tool_use'],
 'gen_ai.usage.input_tokens': 2143,
 'gen_ai.usage.output_tokens': 287,
 'gen_ai.usage.cached_tokens': 1856, // when prompt caching applies
 'gen_ai.operation.name': 'chat', // 'chat', 'text_completion', 'embeddings'
 // Application-specific extensions
 'agent.task_type': 'invoice-processing',
 'agent.step_index': 4,
 'agent.tenant_id': 'cust-419',
}

Instrumenting an Agent Loop

The pattern: one parent span for the agent run, child spans per model call, grandchild spans per tool invocation. Use the OpenTelemetry SDK to create spans manually since automatic instrumentation does not yet cover the full agent shape.

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('app.agent');

export async function runAgent(taskType: string, userInput: string) {
 return tracer.startActiveSpan(
 'agent.run',
 { attributes: { 'agent.task_type': taskType } },
 async (parentSpan) => {
 try {
 const messages = [{ role: 'user', content: userInput }];
 for (let step = 0; step < MAX_STEPS; step++) {
 const response = await tracer.startActiveSpan(
 'llm.chat',
 { attributes: { 'gen_ai.system': 'anthropic', 'agent.step_index': step } },
 async (llmSpan) => {
 const r = await anthropic.messages.create({ model, max_tokens: 1024, tools, messages });
 llmSpan.setAttributes({
 'gen_ai.response.finish_reasons': [r.stop_reason],
 'gen_ai.usage.input_tokens': r.usage.input_tokens,
 'gen_ai.usage.output_tokens': r.usage.output_tokens,
 'gen_ai.usage.cached_tokens': r.usage.cache_read_input_tokens ?? 0,
 });
 return r;
 },
 );

 if (response.stop_reason === 'end_turn') break;
 messages.push({ role: 'assistant', content: response.content });

 for (const block of response.content) {
 if (block.type !== 'tool_use') continue;
 const result = await tracer.startActiveSpan(
 `tool.${block.name}`,
 { attributes: { 'agent.tool.name': block.name, 'agent.tool.id': block.id } },
 async (toolSpan) => {
 try {
 const out = await execute(block.name, block.input);
 toolSpan.setAttribute('agent.tool.success', true);
 return out;
 } catch (err: any) {
 toolSpan.setAttribute('agent.tool.success', false);
 toolSpan.recordException(err);
 throw err;
 }
 },
 );
 messages.push({ role: 'user', content: [{ type: 'tool_result', tool_use_id: block.id, content: result }] });
 }
 }
 parentSpan.setAttribute('agent.completed', true);
 } finally {
 parentSpan.end();
 }
 },
 );
}

Hallucination Signals

Hallucinations rarely announce themselves. They look like normal output until you check. Three signals catch most of them:

Citation validation. If the agent emits citations, verify each one references a document the retrieval actually returned. A citation to a document not in the retrieval set is fabrication.
Numeric verification. Numbers in the output (torque values, prices, dates) must appear in source content. Compare via simple string match or regex extraction.
Refusal-rate drift. A model that suddenly stops refusing impossible questions has shifted into hallucination mode. Track refusal rate over time.

Each signal becomes a span attribute or a derived metric. Alert on hallucination rate exceeding threshold, on refusal rate dropping below baseline, on citation validation failure.

Tools: Langfuse, Helicone, Application Insights

Langfuse. Strongest agent-specific UI. Trace tree visualisation, prompt versioning, evaluation scoring, dataset management. Self-hostable.
Helicone. Easiest setup; routes through their proxy and gives you usage and cost dashboards. Less depth on agent traces; more depth on cost analysis.
Application Insights with GenAI integration. The Azure-native answer. Shares storage with your other telemetry; KQL queries combine agent and infrastructure data. Less polished agent UI than Langfuse.
Datadog and New Relic LLM monitoring. If your platform standard is one of these, use it. Less LLM-specific than Langfuse but enterprise-supported.

The pattern that scales well: emit OpenTelemetry traces; export to Langfuse for engineering use and Application Insights (or your APM) for platform correlation. Both consume the same emit; the cost is one extra exporter configuration.

Sampling and Privacy

Capturing every prompt and every tool argument verbatim risks logging customer PII. Three patterns help:

Hash references instead of raw values. Store the SHA-256 of the prompt or argument; correlate with a separate, access-controlled store that holds the raw content for audit purposes.
Sampling. Capture full content on 10% of traces; aggregate metrics on 100%. Production debugging usually catches issues from the sampled set; metrics use the full set.
PII redaction at ingestion. Run a redaction step before exporting traces. Catches obvious cases (names, phone numbers, IBANs) but not every PII shape.

Dashboards That Earn Their Place

Per-feature task success rate. Whether the agent completed without escalation, broken down by feature.
Cost per task and per tenant. Trend lines over time; alert on regressions.
Step-count distribution. p50 and p95 steps per task. Indicates task scope drift.
Cache hit rate. Per feature; alert on degradation.
Tool call success rate. Per tool; alert on regressions.
Hallucination signal rate. Per feature; alert on spikes.

Alerting Patterns

Effective alerts on agent systems are different from API alerts. Latency-only alerts miss most agent issues. Three alert classes that catch real problems:

Cost regression. Cost per task above baseline by 50% for 30 minutes.
Quality regression. Citation validation failure rate above 5%, or refusal rate dropping by 30% from baseline.
Step-budget exhaustion. More than 10% of tasks hitting MAX_STEPS in the last hour.

Replay and Debugging

When an agent run goes wrong, the engineer needs to replay it. The trace gives you the input, the model responses, the tool results, and the decisions. With that, the failure can be reproduced in a sandbox where the engineer modifies the prompt or tool and watches the difference.

Langfuse and similar tools provide one-click replay. Without that affordance, engineers reconstruct from logs, which is slow and unreliable. Building this capability into the observability stack from day one pays back on every production debugging session.

The Agents That Are Operable

An observable agent is a debuggable agent. The teams that ship reliable agent systems are the ones that treat observability as a Day 1 concern: instrument before measuring quality, dashboard before scaling traffic, alert before something breaks. Agents are not magic systems; they are distributed systems with a probabilistic compute node at the centre. Distributed systems engineering practice applies. The work is not glamorous and the payoff is enormous.