Technical Guide

The Complete Guide to AI Observability

Engineering guide to AI observability in production: logging strategies, metrics collection, tracing AI calls, debugging prompts, and cost tracking.

January 20, 202618 min readOronts Engineering Team

Why AI Systems Need Different Observability

Here's the thing about AI systems: traditional monitoring doesn't cut it. When your REST API returns a 500 error, you know something broke. When your AI returns confidently wrong information, everything looks fine from a technical standpoint. Green dashboards, healthy latency, successful HTTP responses. But your users are getting nonsense.

We learned this the hard way. One of our clients had a customer support agent that started recommending competitors' products. No errors in the logs. No latency spikes. Just quietly giving terrible advice for three days before someone noticed. That's when we realized: observing AI isn't about checking if it's running. It's about checking if it's actually working.

Traditional monitoring tells you if your system is alive. AI observability tells you if your system is sane.

This guide covers everything we've learned about keeping AI systems observable. Not theory - actual practices we use in production every day.

The Four Pillars of AI Observability

Let's break this down into what you actually need to track:

Pillar	What It Covers	Why It Matters
Logging	Every prompt, response, and intermediate step	Debugging when things go wrong
Metrics	Latency, token usage, success rates, costs	Capacity planning and budgeting
Tracing	Full request lifecycle across services	Understanding complex AI workflows
Quality	Response accuracy, relevance, safety	Catching degradation before users do

Most teams start with logging, realize they need metrics for cost control, add tracing when debugging gets painful, and finally implement quality monitoring after a bad incident. Save yourself the trouble and build all four from the start.

Logging: Your First Line of Defense

What to Log

Every AI interaction should capture:

const aiCallLog = {
  // Identity
  requestId: "uuid-v4",
  sessionId: "user-session-id",
  userId: "optional-user-identifier",

  // Input
  prompt: {
    system: "You are a helpful assistant...",
    user: "What's the refund policy?",
    context: ["retrieved_doc_1", "retrieved_doc_2"]
  },

  // Model Configuration
  model: "gpt-4-turbo",
  temperature: 0.7,
  maxTokens: 1000,

  // Output
  response: {
    content: "Our refund policy allows...",
    finishReason: "stop",
    toolCalls: []
  },

  // Performance
  latencyMs: 2340,
  inputTokens: 456,
  outputTokens: 234,
  totalTokens: 690,

  // Cost
  estimatedCostUsd: 0.0138,

  // Metadata
  timestamp: "2025-10-15T14:30:00Z",
  environment: "production",
  version: "1.2.3"
};

Structured Logging Implementation

Don't just dump strings to stdout. Structure your logs so you can actually query them:

interface AILogEntry {
  level: 'debug' | 'info' | 'warn' | 'error';
  event: string;
  requestId: string;
  data: {
    model: string;
    promptHash: string;  // For grouping similar prompts
    inputTokens: number;
    outputTokens: number;
    latencyMs: number;
    success: boolean;
    errorType?: string;
  };
  context?: {
    userId?: string;
    feature?: string;
    experimentId?: string;
  };
}

function logAICall(entry: AILogEntry) {
  // Send to your logging infrastructure
  // We use a combination of structured JSON logs + time-series metrics
  console.log(JSON.stringify({
    ...entry,
    timestamp: new Date().toISOString(),
    service: 'ai-gateway'
  }));
}

Logging Sensitive Data

Here's where it gets tricky. You need to log prompts for debugging, but prompts often contain user data. Our approach:

Hash sensitive fields - Store a hash of PII, not the actual values
Separate storage - Full prompts go to a restricted, encrypted store with short retention
Sampling - Only log full prompts for a percentage of requests in production
Redaction - Use regex patterns to strip common PII patterns before logging

const sensitivePatterns = [
  /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,  // Email
  /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,  // Phone
  /\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g,  // Credit card
];

function redactPII(text: string): string {
  let redacted = text;
  sensitivePatterns.forEach(pattern => {
    redacted = redacted.replace(pattern, '[REDACTED]');
  });
  return redacted;
}

Metrics: Numbers That Actually Matter

Core Metrics to Track

Metric	Type	What It Tells You
`ai.request.latency`	Histogram	How long calls take (p50, p95, p99)
`ai.request.tokens.input`	Counter	Input token consumption
`ai.request.tokens.output`	Counter	Output token consumption
`ai.request.cost`	Counter	Dollar cost per request
`ai.request.success_rate`	Gauge	Percentage of successful completions
`ai.request.error_rate`	Gauge	Failures by error type
`ai.model.rate_limit_hits`	Counter	How often you're being throttled
`ai.cache.hit_rate`	Gauge	Semantic cache effectiveness

Setting Up Metrics Collection

Here's how we instrument our AI gateway:

import { Counter, Histogram, Gauge } from 'prom-client';

const aiLatency = new Histogram({
  name: 'ai_request_latency_ms',
  help: 'AI request latency in milliseconds',
  labelNames: ['model', 'feature', 'status'],
  buckets: [100, 250, 500, 1000, 2500, 5000, 10000]
});

const aiTokens = new Counter({
  name: 'ai_tokens_total',
  help: 'Total tokens consumed',
  labelNames: ['model', 'type', 'feature']  // type: input/output
});

const aiCost = new Counter({
  name: 'ai_cost_usd',
  help: 'Estimated cost in USD',
  labelNames: ['model', 'feature']
});

const aiErrorRate = new Gauge({
  name: 'ai_error_rate',
  help: 'AI request error rate',
  labelNames: ['model', 'error_type']
});

async function instrumentedAICall(params: AICallParams) {
  const startTime = Date.now();

  try {
    const result = await makeAICall(params);

    const latency = Date.now() - startTime;
    aiLatency.observe({
      model: params.model,
      feature: params.feature,
      status: 'success'
    }, latency);

    aiTokens.inc({
      model: params.model,
      type: 'input',
      feature: params.feature
    }, result.usage.inputTokens);

    aiTokens.inc({
      model: params.model,
      type: 'output',
      feature: params.feature
    }, result.usage.outputTokens);

    const cost = calculateCost(params.model, result.usage);
    aiCost.inc({
      model: params.model,
      feature: params.feature
    }, cost);

    return result;
  } catch (error) {
    aiLatency.observe({
      model: params.model,
      feature: params.feature,
      status: 'error'
    }, Date.now() - startTime);

    throw error;
  }
}

Cost Tracking: The Metric That Gets Executive Attention

Let's be honest - cost is usually what brings observability conversations to the table. Here's how to track it properly:

const MODEL_PRICING = {
  'gpt-4-turbo': { input: 0.01, output: 0.03 },      // per 1K tokens
  'gpt-4o': { input: 0.005, output: 0.015 },
  'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
  'claude-3-opus': { input: 0.015, output: 0.075 },
  'claude-3-sonnet': { input: 0.003, output: 0.015 },
  'claude-3-haiku': { input: 0.00025, output: 0.00125 }
};

function calculateCost(model: string, usage: TokenUsage): number {
  const pricing = MODEL_PRICING[model];
  if (!pricing) return 0;

  return (usage.inputTokens / 1000 * pricing.input) +
         (usage.outputTokens / 1000 * pricing.output);
}

// Aggregate costs by feature, team, customer
interface CostAllocation {
  feature: string;
  team: string;
  customerId?: string;
  dailyCost: number;
  monthlyProjection: number;
}

Build dashboards that show:

Daily/weekly/monthly spend by model
Cost per feature or use case
Cost per customer (for B2B)
Projected monthly spend based on current trajectory
Anomaly detection for sudden cost spikes

Tracing: Following the Thread

AI workflows aren't single calls anymore. They're chains, agents, and complex multi-step processes. Tracing lets you follow a request through the entire system.

Implementing Distributed Tracing

import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('ai-service');

async function tracedAgentExecution(task: string, context: RequestContext) {
  return tracer.startActiveSpan('agent.execute', async (span) => {
    span.setAttributes({
      'ai.task': task,
      'ai.session_id': context.sessionId,
      'ai.user_id': context.userId
    });

    try {
      // Step 1: Planning
      const plan = await tracer.startActiveSpan('agent.plan', async (planSpan) => {
        const result = await planTask(task);
        planSpan.setAttributes({
          'ai.model': 'gpt-4-turbo',
          'ai.tokens.input': result.usage.input,
          'ai.tokens.output': result.usage.output,
          'ai.plan.steps': result.steps.length
        });
        return result;
      });

      // Step 2: Execute each step
      for (const step of plan.steps) {
        await tracer.startActiveSpan(`agent.step.${step.type}`, async (stepSpan) => {
          stepSpan.setAttributes({
            'ai.step.type': step.type,
            'ai.step.tool': step.tool
          });

          if (step.type === 'llm_call') {
            await tracedLLMCall(step.params, stepSpan);
          } else if (step.type === 'tool_call') {
            await tracedToolCall(step.tool, step.params, stepSpan);
          }
        });
      }

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    }
  });
}

What Good Traces Show You

A well-instrumented AI trace reveals:

[Agent Execution] 4.2s
├── [Planning] 1.1s
│   └── [LLM Call: gpt-4] 1.0s - 234 tokens in, 156 tokens out
├── [Step 1: RAG Retrieval] 0.3s
│   ├── [Embedding Generation] 0.1s
│   └── [Vector Search] 0.2s - 5 documents retrieved
├── [Step 2: LLM Synthesis] 2.1s
│   └── [LLM Call: gpt-4] 2.0s - 1,456 tokens in, 523 tokens out
└── [Step 3: Response Formatting] 0.7s
    └── [LLM Call: gpt-4o-mini] 0.6s - 678 tokens in, 234 tokens out

Now when someone reports a slow response, you can see exactly where the time went.

Prompt Debugging: The Hard Part

This is where AI observability diverges most from traditional monitoring. How do you debug something that works differently every time?

Prompt Versioning

Treat prompts like code. Version them:

interface PromptVersion {
  id: string;
  name: string;
  version: string;
  template: string;
  variables: string[];
  model: string;
  temperature: number;
  createdAt: Date;
  createdBy: string;
  parentVersion?: string;
}

const promptRegistry = {
  'customer-support-v2.3': {
    id: 'cs-001',
    name: 'Customer Support Agent',
    version: '2.3',
    template: `You are a helpful customer support agent for {{company_name}}.

Your role is to assist customers with their inquiries about {{product_area}}.

Guidelines:
- Always verify the customer's identity before discussing account details
- Never promise refunds without checking policy
- Escalate to human agent if customer expresses frustration

Customer query: {{query}}
Context: {{context}}`,
    variables: ['company_name', 'product_area', 'query', 'context'],
    model: 'gpt-4-turbo',
    temperature: 0.3
  }
};

A/B Testing Prompts

You can't improve what you don't measure. Run experiments on prompt variations:

interface PromptExperiment {
  id: string;
  name: string;
  variants: {
    id: string;
    promptVersion: string;
    trafficPercentage: number;
  }[];
  metrics: string[];  // What to measure
  startDate: Date;
  endDate?: Date;
}

function selectPromptVariant(experimentId: string, userId: string): string {
  const experiment = getExperiment(experimentId);

  // Deterministic assignment based on user ID
  const hash = hashString(userId + experimentId);
  const bucket = hash % 100;

  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.trafficPercentage;
    if (bucket < cumulative) {
      return variant.promptVersion;
    }
  }

  return experiment.variants[0].promptVersion;
}

Debugging Failed Responses

When an AI response goes wrong, you need to answer:

What was the input? - Full prompt including system message, context, and user input
What context was retrieved? - For RAG systems, what documents influenced the response
What was the model's reasoning? - If using chain-of-thought, what steps did it take
How did parameters affect output? - Temperature, top_p, frequency penalty
Was this a one-off or pattern? - Search for similar inputs that produced similar failures

Build a debugging interface that lets you:

-- Find similar failures
SELECT
  request_id,
  prompt_hash,
  response_content,
  error_type,
  timestamp
FROM ai_logs
WHERE
  feature = 'customer-support'
  AND (
    response_content LIKE '%competitor%'  -- Mentioned competitors
    OR quality_score < 0.5                  -- Low quality score
    OR user_feedback = 'negative'          -- User flagged
  )
  AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC
LIMIT 100;

Quality Monitoring: Is the AI Actually Good?

This is the hardest part of AI observability. Technical metrics can be green while the AI is producing garbage.

Automated Quality Checks

interface QualityCheck {
  name: string;
  check: (response: AIResponse, context: RequestContext) => QualityResult;
}

const qualityChecks: QualityCheck[] = [
  {
    name: 'response_length',
    check: (response) => ({
      pass: response.content.length > 50 && response.content.length < 5000,
      score: normalizeLength(response.content.length),
      reason: 'Response length within acceptable range'
    })
  },
  {
    name: 'no_hallucinated_urls',
    check: (response) => {
      const urls = extractUrls(response.content);
      const validUrls = urls.filter(url => isKnownValidUrl(url));
      return {
        pass: urls.length === validUrls.length,
        score: urls.length === 0 ? 1 : validUrls.length / urls.length,
        reason: `${urls.length - validUrls.length} potentially hallucinated URLs`
      };
    }
  },
  {
    name: 'factual_grounding',
    check: (response, context) => {
      // Check if key claims are supported by retrieved context
      const claims = extractClaims(response.content);
      const groundedClaims = claims.filter(claim =>
        isClaimSupportedByContext(claim, context.retrievedDocuments)
      );
      return {
        pass: groundedClaims.length / claims.length > 0.8,
        score: groundedClaims.length / claims.length,
        reason: `${groundedClaims.length}/${claims.length} claims grounded in context`
      };
    }
  },
  {
    name: 'safety_check',
    check: (response) => {
      const safetyResult = runSafetyClassifier(response.content);
      return {
        pass: safetyResult.safe,
        score: safetyResult.confidence,
        reason: safetyResult.category || 'Response passed safety check'
      };
    }
  }
];

async function evaluateResponse(
  response: AIResponse,
  context: RequestContext
): Promise<QualityReport> {
  const results = await Promise.all(
    qualityChecks.map(check => ({
      check: check.name,
      ...check.check(response, context)
    }))
  );

  return {
    overallScore: average(results.map(r => r.score)),
    allPassed: results.every(r => r.pass),
    details: results
  };
}

Human-in-the-Loop Evaluation

Automated checks catch obvious problems. For subtle quality issues, you need human review:

interface HumanEvaluationQueue {
  // Sample a percentage of responses for human review
  sampleRate: number;

  // Always review certain types
  alwaysReviewWhen: {
    lowConfidence: boolean;     // Model uncertainty
    userFeedbackNegative: boolean;
    automatedChecksFailed: boolean;
    highValueCustomer: boolean;
  };

  // Evaluation criteria for reviewers
  criteria: {
    accuracy: 'Did the response contain correct information?';
    relevance: 'Did the response address the user query?';
    completeness: 'Was the response thorough enough?';
    tone: 'Was the tone appropriate for the context?';
    safety: 'Were there any concerning elements?';
  };
}

Alerting: Knowing When Things Go Wrong

Alert Thresholds for AI Systems

Alert	Threshold	Severity	Action
Latency p95 > 10s	5 min sustained	Warning	Investigate model provider
Error rate > 5%	2 min sustained	Critical	Check API status, failover
Cost spike > 3x baseline	1 hour	Warning	Review traffic, check for loops
Quality score drop > 20%	1 hour	Critical	Pause feature, investigate
Rate limit hits > 10/min	5 min	Warning	Scale back, check for abuse
Prompt injection detected	Any	Critical	Block request, review

Implementing Smart Alerts

interface AIAlert {
  name: string;
  condition: (metrics: AIMetrics) => boolean;
  severity: 'info' | 'warning' | 'critical';
  cooldown: number;  // Minutes before re-alerting
  notification: {
    slack?: string;
    pagerduty?: string;
    email?: string[];
  };
}

const alerts: AIAlert[] = [
  {
    name: 'high_latency',
    condition: (m) => m.latencyP95 > 10000,
    severity: 'warning',
    cooldown: 30,
    notification: { slack: '#ai-alerts' }
  },
  {
    name: 'quality_degradation',
    condition: (m) => m.qualityScore < 0.7 && m.previousQualityScore > 0.85,
    severity: 'critical',
    cooldown: 60,
    notification: {
      slack: '#ai-alerts',
      pagerduty: 'ai-oncall'
    }
  },
  {
    name: 'cost_anomaly',
    condition: (m) => m.hourlyCost > m.expectedHourlyCost * 3,
    severity: 'warning',
    cooldown: 60,
    notification: {
      slack: '#ai-alerts',
      email: ['ai-team@company.com']
    }
  }
];

Building Your Observability Stack

Recommended Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        AI Application                            │
├─────────────────────────────────────────────────────────────────┤
│                     Instrumentation Layer                        │
│  (OpenTelemetry SDK, Custom Metrics, Structured Logging)        │
└─────────────────────────────┬───────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐   ┌─────────────────┐   ┌───────────────┐
│    Traces     │   │     Metrics     │   │     Logs      │
│   (Jaeger/    │   │  (Prometheus/   │   │ (Elasticsearch│
│    Tempo)     │   │   Datadog)      │   │   /Loki)      │
└───────┬───────┘   └────────┬────────┘   └───────┬───────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   Dashboards    │
                    │    (Grafana)    │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │    Alerting     │
                    │  (PagerDuty/    │
                    │   Slack/OpsGenie│
                    └─────────────────┘

Tools We Recommend

Category	Open Source	Commercial
Tracing	Jaeger, Zipkin	Datadog, New Relic
Metrics	Prometheus + Grafana	Datadog, Dynatrace
Logging	ELK Stack, Loki	Splunk, Datadog
AI-Specific	LangSmith, Phoenix	Weights & Biases, Helicone
Alerting	Alertmanager	PagerDuty, OpsGenie

Real-World Example: Full Observability Setup

Here's how we instrument a production AI feature end-to-end:

import { trace, metrics, context } from '@opentelemetry/api';
import { logger } from './logging';
import { qualityChecker } from './quality';
import { costTracker } from './costs';

class ObservableAIService {
  private tracer = trace.getTracer('ai-service');
  private meter = metrics.getMeter('ai-service');

  private latencyHistogram = this.meter.createHistogram('ai.latency');
  private tokenCounter = this.meter.createCounter('ai.tokens');
  private costCounter = this.meter.createCounter('ai.cost');
  private qualityGauge = this.meter.createObservableGauge('ai.quality');

  async complete(request: AIRequest): Promise<AIResponse> {
    const span = this.tracer.startSpan('ai.complete');
    const startTime = Date.now();

    const requestId = generateRequestId();
    span.setAttribute('request_id', requestId);
    span.setAttribute('model', request.model);
    span.setAttribute('feature', request.feature);

    try {
      // Log the request
      logger.info('ai.request.start', {
        requestId,
        model: request.model,
        feature: request.feature,
        promptHash: hashPrompt(request.prompt),
        inputTokenEstimate: estimateTokens(request.prompt)
      });

      // Make the AI call
      const response = await this.makeAICall(request);

      const latency = Date.now() - startTime;
      const cost = costTracker.calculate(request.model, response.usage);

      // Record metrics
      this.latencyHistogram.record(latency, {
        model: request.model,
        feature: request.feature,
        status: 'success'
      });

      this.tokenCounter.add(response.usage.inputTokens, {
        model: request.model,
        type: 'input'
      });

      this.tokenCounter.add(response.usage.outputTokens, {
        model: request.model,
        type: 'output'
      });

      this.costCounter.add(cost, {
        model: request.model,
        feature: request.feature
      });

      // Run quality checks
      const quality = await qualityChecker.evaluate(response, request);

      // Log the response
      logger.info('ai.request.complete', {
        requestId,
        latencyMs: latency,
        inputTokens: response.usage.inputTokens,
        outputTokens: response.usage.outputTokens,
        costUsd: cost,
        qualityScore: quality.overallScore,
        qualityPassed: quality.allPassed
      });

      // Store for debugging (with appropriate retention)
      await this.storeForDebugging(requestId, request, response, quality);

      span.setStatus({ code: SpanStatusCode.OK });
      return response;

    } catch (error) {
      const latency = Date.now() - startTime;

      this.latencyHistogram.record(latency, {
        model: request.model,
        feature: request.feature,
        status: 'error'
      });

      logger.error('ai.request.error', {
        requestId,
        error: error.message,
        errorType: error.constructor.name,
        latencyMs: latency
      });

      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);

      throw error;
    } finally {
      span.end();
    }
  }
}

Getting Started: Your First Week

Day 1-2: Basic Logging

Add structured logging to all AI calls
Include: model, latency, token counts, feature name
Store logs somewhere queryable

Day 3-4: Core Metrics

Set up token and cost counters
Create latency histograms by model and feature
Build your first dashboard

Day 5: Alerting

Alert on error rate spikes
Alert on cost anomalies
Alert on latency degradation

Week 2: Quality and Tracing

Implement basic quality checks
Add distributed tracing for multi-step AI workflows
Start collecting user feedback

Conclusion

AI observability isn't optional anymore. As AI systems handle more critical workflows, you need to know not just if they're running, but if they're actually working correctly.

The good news: most of what you need can be built on top of existing observability infrastructure. OpenTelemetry, Prometheus, structured logging - these tools work for AI too. The difference is knowing what to measure and how to interpret it.

Start simple. Log everything. Track costs. Add quality checks. Build from there.

The best time to add observability was before you launched. The second best time is now.

We've helped teams go from "we have no idea what our AI is doing" to "we caught that issue in 3 minutes" in a matter of weeks. The investment pays for itself the first time you debug a production issue in minutes instead of hours.

If you're building AI systems and want to talk about observability strategies, reach out. We've seen a lot of failure modes and we're happy to share what we've learned.

Topics covered

AI observabilityLLM monitoringprompt debuggingAI metricsmodel performancecost trackingAI tracingproduction AIAI logging

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation