LLM Observability: OpenTelemetry, Langfuse, App Insights
Production LLM applications generate a different observability signal shape than traditional services. Requests involve prompt templates, token usage, tool calls, retrieval steps, and probabilistic outputs. The flat "one span per HTTP request" model leaves most of it invisible. The stack that works in 2026 is OpenTelemetry for portable tracing, Langfuse for LLM-specific semantics, and Application Insights for platform correlation. This post wires them together.
The Three Roles
- OpenTelemetry. The portable transport layer. Captures spans, traces, and metrics in a vendor-neutral format. OTel's GenAI semantic conventions define standard attributes for LLM calls (model, input/output tokens, stop reason).
- Langfuse. The LLM-specific observability surface. Understands traces with multiple LLM and tool spans, attaches eval scores, renders prompts and completions in a reviewable form. Self-hostable or cloud.
- Application Insights. The Azure platform layer. Correlates LLM traces with HTTP requests, VM metrics, dependency calls, and alerts.
The Architecture
Instrument once with OpenTelemetry, export to both Langfuse and Application Insights. Langfuse becomes your day-to-day LLM engineering view. Application Insights becomes the operational view for the rest of the platform. The two share trace IDs, so a production incident can be traced from an HTTP 500 in Application Insights straight to the specific prompt and tool calls in Langfuse.
Setting It Up
// instrumentation.ts (Node, OpenTelemetry SDK)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { AzureMonitorTraceExporter } from '@azure/monitor-opentelemetry-exporter';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
const langfuseExporter = new OTLPTraceExporter({
url: process.env.LANGFUSE_OTLP_ENDPOINT,
headers: { Authorization: `Basic ${Buffer.from(`${pk}:${sk}`).toString('base64')}` },
});
const azureExporter = new AzureMonitorTraceExporter({
connectionString: process.env.APPINSIGHTS_CONNECTION_STRING,
});
const sdk = new NodeSDK({
spanProcessors: [
new BatchSpanProcessor(langfuseExporter),
new BatchSpanProcessor(azureExporter),
],
});
sdk.start();
Instrumenting an LLM Call With GenAI Semantics
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('app.llm');
export async function runPrompt(system: string, user: string) {
return tracer.startActiveSpan('llm.chat', async (span) => {
span.setAttributes({
'gen_ai.system': 'anthropic',
'gen_ai.request.model': 'claude-opus-4-7',
'gen_ai.request.max_tokens': 1024,
'gen_ai.prompt.0.role': 'system',
'gen_ai.prompt.0.content': system,
'gen_ai.prompt.1.role': 'user',
'gen_ai.prompt.1.content': user,
});
const res = await anthropic.messages.create({ /* ... */ });
span.setAttributes({
'gen_ai.response.model': res.model,
'gen_ai.response.finish_reasons': [res.stop_reason],
'gen_ai.usage.input_tokens': res.usage.input_tokens,
'gen_ai.usage.output_tokens': res.usage.output_tokens,
'gen_ai.usage.cache_read_input_tokens': res.usage.cache_read_input_tokens ?? 0,
});
span.end();
return res;
});
}
Agent Traces. Nested Spans Matter
An agent trace is a tree: the top span is the user request; children are LLM calls, tool calls, and retrieval steps. The value of structured spans is exactly this hierarchy. Without it, you cannot distinguish the cost of a slow tool from the cost of a slow model call. Use OpenTelemetry context propagation to ensure each tool invocation creates a child span of the LLM call that triggered it.
Attaching Evaluation Scores
Langfuse's strongest feature is scoring. Attach numeric scores (from a CI eval run, a human reviewer, or a production heuristic) directly to a trace. Regressions become visible: an answer-relevancy score that drops after a prompt change is one hop from the diff in Langfuse's UI.
Cost and Token Dashboards
With GenAI semantic attributes on every LLM span, cost dashboards become arithmetic. Langfuse renders them natively; Application Insights users can build a KQL query over dependencies with GenAI attributes to break down spend by user, tenant, or feature.
// KQL — tokens and approximate cost per tenant over the last 7 days
dependencies
| where timestamp > ago(7d) and name == "llm.chat"
| extend
tenant = tostring(customDimensions["tenant.id"]),
inTok = tolong(customDimensions["gen_ai.usage.input_tokens"]),
outTok = tolong(customDimensions["gen_ai.usage.output_tokens"])
| summarize
totalIn = sum(inTok),
totalOut = sum(outTok),
calls = count()
by tenant
| extend estUsd = (totalIn * 3.0 + totalOut * 15.0) / 1e6 // indicative per-mtoken rates
| order by estUsd desc
Sampling and PII
Full prompt and completion logging in production can violate GDPR and DORA retention rules if it captures PII. Sample for eval purposes, strip on the span processor, or redirect full-content traces to a separate, access-controlled store with short retention. Never default to "capture everything" on a production system without a data-handling policy behind it.
The Short Stack
- OpenTelemetry Node or Python SDK, instrumented once at process start.
- GenAI semantic conventions on every LLM and tool span.
- Dual export: Langfuse (for LLM engineering), Application Insights (for platform correlation).
- Scores attached to traces from CI evals and production heuristics.
- Cost dashboards driven by GenAI attributes, with tenant and feature dimensions.
- PII-aware sampling and retention policy.