The Complete Guide to AI Observability
Engineering guide to AI observability in production: logging strategies, metrics collection, tracing AI calls, debugging prompts, and cost tracking.
Why AI Systems Need Different Observability
Here's the thing about AI systems: traditional monitoring doesn't cut it. When your REST API returns a 500 error, you know something broke. When your AI returns confidently wrong information, everything looks fine from a technical standpoint. Green dashboards, healthy latency, successful HTTP responses. But your users are getting nonsense.
We learned this the hard way. One of our clients had a customer support agent that started recommending competitors' products. No errors in the logs. No latency spikes. Just quietly giving terrible advice for three days before someone noticed. That's when we realized: observing AI isn't about checking if it's running. It's about checking if it's actually working.
Traditional monitoring tells you if your system is alive. AI observability tells you if your system is sane.
This guide covers everything we've learned about keeping AI systems observable. Not theory - actual practices we use in production every day.
The Four Pillars of AI Observability
Let's break this down into what you actually need to track:
| Pillar | What It Covers | Why It Matters |
|---|---|---|
| Logging | Every prompt, response, and intermediate step | Debugging when things go wrong |
| Metrics | Latency, token usage, success rates, costs | Capacity planning and budgeting |
| Tracing | Full request lifecycle across services | Understanding complex AI workflows |
| Quality | Response accuracy, relevance, safety | Catching degradation before users do |
Most teams start with logging, realize they need metrics for cost control, add tracing when debugging gets painful, and finally implement quality monitoring after a bad incident. Save yourself the trouble and build all four from the start.
Logging: Your First Line of Defense
What to Log
Every AI interaction should capture:
const aiCallLog = {
// Identity
requestId: "uuid-v4",
sessionId: "user-session-id",
userId: "optional-user-identifier",
// Input
prompt: {
system: "You are a helpful assistant...",
user: "What's the refund policy?",
context: ["retrieved_doc_1", "retrieved_doc_2"]
},
// Model Configuration
model: "gpt-4-turbo",
temperature: 0.7,
maxTokens: 1000,
// Output
response: {
content: "Our refund policy allows...",
finishReason: "stop",
toolCalls: []
},
// Performance
latencyMs: 2340,
inputTokens: 456,
outputTokens: 234,
totalTokens: 690,
// Cost
estimatedCostUsd: 0.0138,
// Metadata
timestamp: "2025-10-15T14:30:00Z",
environment: "production",
version: "1.2.3"
};
Structured Logging Implementation
Don't just dump strings to stdout. Structure your logs so you can actually query them:
interface AILogEntry {
level: 'debug' | 'info' | 'warn' | 'error';
event: string;
requestId: string;
data: {
model: string;
promptHash: string; // For grouping similar prompts
inputTokens: number;
outputTokens: number;
latencyMs: number;
success: boolean;
errorType?: string;
};
context?: {
userId?: string;
feature?: string;
experimentId?: string;
};
}
function logAICall(entry: AILogEntry) {
// Send to your logging infrastructure
// We use a combination of structured JSON logs + time-series metrics
console.log(JSON.stringify({
...entry,
timestamp: new Date().toISOString(),
service: 'ai-gateway'
}));
}
Logging Sensitive Data
Here's where it gets tricky. You need to log prompts for debugging, but prompts often contain user data. Our approach:
- Hash sensitive fields - Store a hash of PII, not the actual values
- Separate storage - Full prompts go to a restricted, encrypted store with short retention
- Sampling - Only log full prompts for a percentage of requests in production
- Redaction - Use regex patterns to strip common PII patterns before logging
const sensitivePatterns = [
/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, // Email
/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, // Phone
/\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g, // Credit card
];
function redactPII(text: string): string {
let redacted = text;
sensitivePatterns.forEach(pattern => {
redacted = redacted.replace(pattern, '[REDACTED]');
});
return redacted;
}
Metrics: Numbers That Actually Matter
Core Metrics to Track
| Metric | Type | What It Tells You |
|---|---|---|
ai.request.latency | Histogram | How long calls take (p50, p95, p99) |
ai.request.tokens.input | Counter | Input token consumption |
ai.request.tokens.output | Counter | Output token consumption |
ai.request.cost | Counter | Dollar cost per request |
ai.request.success_rate | Gauge | Percentage of successful completions |
ai.request.error_rate | Gauge | Failures by error type |
ai.model.rate_limit_hits | Counter | How often you're being throttled |
ai.cache.hit_rate | Gauge | Semantic cache effectiveness |
Setting Up Metrics Collection
Here's how we instrument our AI gateway:
import { Counter, Histogram, Gauge } from 'prom-client';
const aiLatency = new Histogram({
name: 'ai_request_latency_ms',
help: 'AI request latency in milliseconds',
labelNames: ['model', 'feature', 'status'],
buckets: [100, 250, 500, 1000, 2500, 5000, 10000]
});
const aiTokens = new Counter({
name: 'ai_tokens_total',
help: 'Total tokens consumed',
labelNames: ['model', 'type', 'feature'] // type: input/output
});
const aiCost = new Counter({
name: 'ai_cost_usd',
help: 'Estimated cost in USD',
labelNames: ['model', 'feature']
});
const aiErrorRate = new Gauge({
name: 'ai_error_rate',
help: 'AI request error rate',
labelNames: ['model', 'error_type']
});
async function instrumentedAICall(params: AICallParams) {
const startTime = Date.now();
try {
const result = await makeAICall(params);
const latency = Date.now() - startTime;
aiLatency.observe({
model: params.model,
feature: params.feature,
status: 'success'
}, latency);
aiTokens.inc({
model: params.model,
type: 'input',
feature: params.feature
}, result.usage.inputTokens);
aiTokens.inc({
model: params.model,
type: 'output',
feature: params.feature
}, result.usage.outputTokens);
const cost = calculateCost(params.model, result.usage);
aiCost.inc({
model: params.model,
feature: params.feature
}, cost);
return result;
} catch (error) {
aiLatency.observe({
model: params.model,
feature: params.feature,
status: 'error'
}, Date.now() - startTime);
throw error;
}
}
Cost Tracking: The Metric That Gets Executive Attention
Let's be honest - cost is usually what brings observability conversations to the table. Here's how to track it properly:
const MODEL_PRICING = {
'gpt-4-turbo': { input: 0.01, output: 0.03 }, // per 1K tokens
'gpt-4o': { input: 0.005, output: 0.015 },
'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
'claude-3-opus': { input: 0.015, output: 0.075 },
'claude-3-sonnet': { input: 0.003, output: 0.015 },
'claude-3-haiku': { input: 0.00025, output: 0.00125 }
};
function calculateCost(model: string, usage: TokenUsage): number {
const pricing = MODEL_PRICING[model];
if (!pricing) return 0;
return (usage.inputTokens / 1000 * pricing.input) +
(usage.outputTokens / 1000 * pricing.output);
}
// Aggregate costs by feature, team, customer
interface CostAllocation {
feature: string;
team: string;
customerId?: string;
dailyCost: number;
monthlyProjection: number;
}
Build dashboards that show:
- Daily/weekly/monthly spend by model
- Cost per feature or use case
- Cost per customer (for B2B)
- Projected monthly spend based on current trajectory
- Anomaly detection for sudden cost spikes
Tracing: Following the Thread
AI workflows aren't single calls anymore. They're chains, agents, and complex multi-step processes. Tracing lets you follow a request through the entire system.
Implementing Distributed Tracing
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('ai-service');
async function tracedAgentExecution(task: string, context: RequestContext) {
return tracer.startActiveSpan('agent.execute', async (span) => {
span.setAttributes({
'ai.task': task,
'ai.session_id': context.sessionId,
'ai.user_id': context.userId
});
try {
// Step 1: Planning
const plan = await tracer.startActiveSpan('agent.plan', async (planSpan) => {
const result = await planTask(task);
planSpan.setAttributes({
'ai.model': 'gpt-4-turbo',
'ai.tokens.input': result.usage.input,
'ai.tokens.output': result.usage.output,
'ai.plan.steps': result.steps.length
});
return result;
});
// Step 2: Execute each step
for (const step of plan.steps) {
await tracer.startActiveSpan(`agent.step.${step.type}`, async (stepSpan) => {
stepSpan.setAttributes({
'ai.step.type': step.type,
'ai.step.tool': step.tool
});
if (step.type === 'llm_call') {
await tracedLLMCall(step.params, stepSpan);
} else if (step.type === 'tool_call') {
await tracedToolCall(step.tool, step.params, stepSpan);
}
});
}
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
}
});
}
What Good Traces Show You
A well-instrumented AI trace reveals:
[Agent Execution] 4.2s
βββ [Planning] 1.1s
β βββ [LLM Call: gpt-4] 1.0s - 234 tokens in, 156 tokens out
βββ [Step 1: RAG Retrieval] 0.3s
β βββ [Embedding Generation] 0.1s
β βββ [Vector Search] 0.2s - 5 documents retrieved
βββ [Step 2: LLM Synthesis] 2.1s
β βββ [LLM Call: gpt-4] 2.0s - 1,456 tokens in, 523 tokens out
βββ [Step 3: Response Formatting] 0.7s
βββ [LLM Call: gpt-4o-mini] 0.6s - 678 tokens in, 234 tokens out
Now when someone reports a slow response, you can see exactly where the time went.
Prompt Debugging: The Hard Part
This is where AI observability diverges most from traditional monitoring. How do you debug something that works differently every time?
Prompt Versioning
Treat prompts like code. Version them:
interface PromptVersion {
id: string;
name: string;
version: string;
template: string;
variables: string[];
model: string;
temperature: number;
createdAt: Date;
createdBy: string;
parentVersion?: string;
}
const promptRegistry = {
'customer-support-v2.3': {
id: 'cs-001',
name: 'Customer Support Agent',
version: '2.3',
template: `You are a helpful customer support agent for {{company_name}}.
Your role is to assist customers with their inquiries about {{product_area}}.
Guidelines:
- Always verify the customer's identity before discussing account details
- Never promise refunds without checking policy
- Escalate to human agent if customer expresses frustration
Customer query: {{query}}
Context: {{context}}`,
variables: ['company_name', 'product_area', 'query', 'context'],
model: 'gpt-4-turbo',
temperature: 0.3
}
};
A/B Testing Prompts
You can't improve what you don't measure. Run experiments on prompt variations:
interface PromptExperiment {
id: string;
name: string;
variants: {
id: string;
promptVersion: string;
trafficPercentage: number;
}[];
metrics: string[]; // What to measure
startDate: Date;
endDate?: Date;
}
function selectPromptVariant(experimentId: string, userId: string): string {
const experiment = getExperiment(experimentId);
// Deterministic assignment based on user ID
const hash = hashString(userId + experimentId);
const bucket = hash % 100;
let cumulative = 0;
for (const variant of experiment.variants) {
cumulative += variant.trafficPercentage;
if (bucket < cumulative) {
return variant.promptVersion;
}
}
return experiment.variants[0].promptVersion;
}
Debugging Failed Responses
When an AI response goes wrong, you need to answer:
- What was the input? - Full prompt including system message, context, and user input
- What context was retrieved? - For RAG systems, what documents influenced the response
- What was the model's reasoning? - If using chain-of-thought, what steps did it take
- How did parameters affect output? - Temperature, top_p, frequency penalty
- Was this a one-off or pattern? - Search for similar inputs that produced similar failures
Build a debugging interface that lets you:
-- Find similar failures
SELECT
request_id,
prompt_hash,
response_content,
error_type,
timestamp
FROM ai_logs
WHERE
feature = 'customer-support'
AND (
response_content LIKE '%competitor%' -- Mentioned competitors
OR quality_score < 0.5 -- Low quality score
OR user_feedback = 'negative' -- User flagged
)
AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC
LIMIT 100;
Quality Monitoring: Is the AI Actually Good?
This is the hardest part of AI observability. Technical metrics can be green while the AI is producing garbage.
Automated Quality Checks
interface QualityCheck {
name: string;
check: (response: AIResponse, context: RequestContext) => QualityResult;
}
const qualityChecks: QualityCheck[] = [
{
name: 'response_length',
check: (response) => ({
pass: response.content.length > 50 && response.content.length < 5000,
score: normalizeLength(response.content.length),
reason: 'Response length within acceptable range'
})
},
{
name: 'no_hallucinated_urls',
check: (response) => {
const urls = extractUrls(response.content);
const validUrls = urls.filter(url => isKnownValidUrl(url));
return {
pass: urls.length === validUrls.length,
score: urls.length === 0 ? 1 : validUrls.length / urls.length,
reason: `${urls.length - validUrls.length} potentially hallucinated URLs`
};
}
},
{
name: 'factual_grounding',
check: (response, context) => {
// Check if key claims are supported by retrieved context
const claims = extractClaims(response.content);
const groundedClaims = claims.filter(claim =>
isClaimSupportedByContext(claim, context.retrievedDocuments)
);
return {
pass: groundedClaims.length / claims.length > 0.8,
score: groundedClaims.length / claims.length,
reason: `${groundedClaims.length}/${claims.length} claims grounded in context`
};
}
},
{
name: 'safety_check',
check: (response) => {
const safetyResult = runSafetyClassifier(response.content);
return {
pass: safetyResult.safe,
score: safetyResult.confidence,
reason: safetyResult.category || 'Response passed safety check'
};
}
}
];
async function evaluateResponse(
response: AIResponse,
context: RequestContext
): Promise<QualityReport> {
const results = await Promise.all(
qualityChecks.map(check => ({
check: check.name,
...check.check(response, context)
}))
);
return {
overallScore: average(results.map(r => r.score)),
allPassed: results.every(r => r.pass),
details: results
};
}
Human-in-the-Loop Evaluation
Automated checks catch obvious problems. For subtle quality issues, you need human review:
interface HumanEvaluationQueue {
// Sample a percentage of responses for human review
sampleRate: number;
// Always review certain types
alwaysReviewWhen: {
lowConfidence: boolean; // Model uncertainty
userFeedbackNegative: boolean;
automatedChecksFailed: boolean;
highValueCustomer: boolean;
};
// Evaluation criteria for reviewers
criteria: {
accuracy: 'Did the response contain correct information?';
relevance: 'Did the response address the user query?';
completeness: 'Was the response thorough enough?';
tone: 'Was the tone appropriate for the context?';
safety: 'Were there any concerning elements?';
};
}
Alerting: Knowing When Things Go Wrong
Alert Thresholds for AI Systems
| Alert | Threshold | Severity | Action |
|---|---|---|---|
| Latency p95 > 10s | 5 min sustained | Warning | Investigate model provider |
| Error rate > 5% | 2 min sustained | Critical | Check API status, failover |
| Cost spike > 3x baseline | 1 hour | Warning | Review traffic, check for loops |
| Quality score drop > 20% | 1 hour | Critical | Pause feature, investigate |
| Rate limit hits > 10/min | 5 min | Warning | Scale back, check for abuse |
| Prompt injection detected | Any | Critical | Block request, review |
Implementing Smart Alerts
interface AIAlert {
name: string;
condition: (metrics: AIMetrics) => boolean;
severity: 'info' | 'warning' | 'critical';
cooldown: number; // Minutes before re-alerting
notification: {
slack?: string;
pagerduty?: string;
email?: string[];
};
}
const alerts: AIAlert[] = [
{
name: 'high_latency',
condition: (m) => m.latencyP95 > 10000,
severity: 'warning',
cooldown: 30,
notification: { slack: '#ai-alerts' }
},
{
name: 'quality_degradation',
condition: (m) => m.qualityScore < 0.7 && m.previousQualityScore > 0.85,
severity: 'critical',
cooldown: 60,
notification: {
slack: '#ai-alerts',
pagerduty: 'ai-oncall'
}
},
{
name: 'cost_anomaly',
condition: (m) => m.hourlyCost > m.expectedHourlyCost * 3,
severity: 'warning',
cooldown: 60,
notification: {
slack: '#ai-alerts',
email: ['ai-team@company.com']
}
}
];
Building Your Observability Stack
Recommended Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Application β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Instrumentation Layer β
β (OpenTelemetry SDK, Custom Metrics, Structured Logging) β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β Traces β β Metrics β β Logs β
β (Jaeger/ β β (Prometheus/ β β (Elasticsearchβ
β Tempo) β β Datadog) β β /Loki) β
βββββββββ¬ββββββββ ββββββββββ¬βββββββββ βββββββββ¬ββββββββ
β β β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Dashboards β
β (Grafana) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Alerting β
β (PagerDuty/ β
β Slack/OpsGenieβ
βββββββββββββββββββ
Tools We Recommend
| Category | Open Source | Commercial |
|---|---|---|
| Tracing | Jaeger, Zipkin | Datadog, New Relic |
| Metrics | Prometheus + Grafana | Datadog, Dynatrace |
| Logging | ELK Stack, Loki | Splunk, Datadog |
| AI-Specific | LangSmith, Phoenix | Weights & Biases, Helicone |
| Alerting | Alertmanager | PagerDuty, OpsGenie |
Real-World Example: Full Observability Setup
Here's how we instrument a production AI feature end-to-end:
import { trace, metrics, context } from '@opentelemetry/api';
import { logger } from './logging';
import { qualityChecker } from './quality';
import { costTracker } from './costs';
class ObservableAIService {
private tracer = trace.getTracer('ai-service');
private meter = metrics.getMeter('ai-service');
private latencyHistogram = this.meter.createHistogram('ai.latency');
private tokenCounter = this.meter.createCounter('ai.tokens');
private costCounter = this.meter.createCounter('ai.cost');
private qualityGauge = this.meter.createObservableGauge('ai.quality');
async complete(request: AIRequest): Promise<AIResponse> {
const span = this.tracer.startSpan('ai.complete');
const startTime = Date.now();
const requestId = generateRequestId();
span.setAttribute('request_id', requestId);
span.setAttribute('model', request.model);
span.setAttribute('feature', request.feature);
try {
// Log the request
logger.info('ai.request.start', {
requestId,
model: request.model,
feature: request.feature,
promptHash: hashPrompt(request.prompt),
inputTokenEstimate: estimateTokens(request.prompt)
});
// Make the AI call
const response = await this.makeAICall(request);
const latency = Date.now() - startTime;
const cost = costTracker.calculate(request.model, response.usage);
// Record metrics
this.latencyHistogram.record(latency, {
model: request.model,
feature: request.feature,
status: 'success'
});
this.tokenCounter.add(response.usage.inputTokens, {
model: request.model,
type: 'input'
});
this.tokenCounter.add(response.usage.outputTokens, {
model: request.model,
type: 'output'
});
this.costCounter.add(cost, {
model: request.model,
feature: request.feature
});
// Run quality checks
const quality = await qualityChecker.evaluate(response, request);
// Log the response
logger.info('ai.request.complete', {
requestId,
latencyMs: latency,
inputTokens: response.usage.inputTokens,
outputTokens: response.usage.outputTokens,
costUsd: cost,
qualityScore: quality.overallScore,
qualityPassed: quality.allPassed
});
// Store for debugging (with appropriate retention)
await this.storeForDebugging(requestId, request, response, quality);
span.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (error) {
const latency = Date.now() - startTime;
this.latencyHistogram.record(latency, {
model: request.model,
feature: request.feature,
status: 'error'
});
logger.error('ai.request.error', {
requestId,
error: error.message,
errorType: error.constructor.name,
latencyMs: latency
});
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
}
}
Getting Started: Your First Week
Day 1-2: Basic Logging
- Add structured logging to all AI calls
- Include: model, latency, token counts, feature name
- Store logs somewhere queryable
Day 3-4: Core Metrics
- Set up token and cost counters
- Create latency histograms by model and feature
- Build your first dashboard
Day 5: Alerting
- Alert on error rate spikes
- Alert on cost anomalies
- Alert on latency degradation
Week 2: Quality and Tracing
- Implement basic quality checks
- Add distributed tracing for multi-step AI workflows
- Start collecting user feedback
Conclusion
AI observability isn't optional anymore. As AI systems handle more critical workflows, you need to know not just if they're running, but if they're actually working correctly.
The good news: most of what you need can be built on top of existing observability infrastructure. OpenTelemetry, Prometheus, structured logging - these tools work for AI too. The difference is knowing what to measure and how to interpret it.
Start simple. Log everything. Track costs. Add quality checks. Build from there.
The best time to add observability was before you launched. The second best time is now.
We've helped teams go from "we have no idea what our AI is doing" to "we caught that issue in 3 minutes" in a matter of weeks. The investment pays for itself the first time you debug a production issue in minutes instead of hours.
If you're building AI systems and want to talk about observability strategies, reach out. We've seen a lot of failure modes and we're happy to share what we've learned.
Topics covered
Related Guides
From AI Prototype to Production: The 15 Things That Change Completely
What changes when you move an AI system from prototype to production. Auth, cost tracking, PII handling, fallback models, monitoring, compliance, and the team structure shift.
Read guideEnterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideAgentic Commerce: How to Let AI Agents Buy Things Safely
How to design governed AI agent-initiated commerce. Policy engines, HITL approval gates, HMAC receipts, idempotency, tenant scoping, and the full Agentic Checkout Protocol.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation