Technical Guide

Observability That Helps at 3am: Logs, Traces, and What Actually Matters

Production observability beyond dashboards. Structured logging, correlation IDs, PII-safe logs, alert fatigue prevention, cost management, and the observability maturity model.

April 10, 202614 min readOronts Engineering Team

The Three Pillars Are Wrong

Every observability talk starts with "the three pillars: logs, metrics, and traces." That framing is incomplete. It tells you what to collect but not what to do with it. The real question at 3am is not "do I have logs?" It's "can I find the one log line that explains why this customer's order failed, across 7 services, in under 2 minutes?"

We run production systems with multiple services, background workers, message queues, and AI pipelines. This article covers what actually helps when things break. For OpenTelemetry-specific implementation patterns, see our OpenTelemetry guide. For AI-specific observability, see our AI observability guide.

Structured Logging: The Format That Saves You

Unstructured logs are useless at scale. console.log('Processing order ' + orderId) becomes noise in a system producing 10,000 log lines per minute.

Structured logs are queryable, filterable, and alertable:

// Unstructured (useless at scale)
console.log('Processing order 12345 for customer sara@beispiel.de');

// Structured (queryable, PII-free)
logger.info('order_processing_started', {
    order_id: 'ord_12345',
    customer_id: 'cust_abc',  // ID, not email
    channel: 'web',
    tenant_id: 'tenant_acme',
    items_count: 3,
    total_cents: 15900,
});

Log Level Discipline

LevelWhen to UseExample
errorSomething broke that needs human attentionDatabase connection failed, payment declined
warnSomething unexpected but handledFallback model used, cache miss, retry succeeded
infoBusiness events that matter for auditingOrder created, user logged in, import completed
debugTechnical details for developmentSQL query executed, cache hit, config loaded

Production should run at info level. debug generates too much volume. error and warn should always trigger review (if not alerts).

No PII in Logs

Your log aggregator (Datadog, CloudWatch, Loki, Elasticsearch) indexes everything. If logs contain emails, names, or phone numbers, your log infrastructure becomes GDPR-regulated.

// Bad: PII in logs
logger.info('Email sent to sara.mustermann@beispiel.de about order 12345');

// Good: IDs only
logger.info('notification_sent', {
    recipient_id: 'cust_abc',
    notification_type: 'order_confirmation',
    order_id: 'ord_12345',
    channel: 'email',
});

For the full PII protection architecture, see our data leakage prevention guide.

Correlation IDs: Tracing a Request Across Services

A single user request might touch an API gateway, an auth service, a product service, a payment service, a notification worker, and a search indexer. Without a correlation ID, finding all log entries for one request requires guessing timestamps and grep.

// Generate at the edge (API gateway or first service)
const correlationId = crypto.randomUUID();

// Pass through every service via headers
const response = await httpClient.post('/api/orders', body, {
    headers: { 'X-Correlation-Id': correlationId },
});

// Include in every log entry
logger.info('order_created', {
    correlation_id: correlationId,
    order_id: order.id,
    tenant_id: ctx.tenantId,
});

// Include in every message sent to queues
await queue.add('send_confirmation', {
    orderId: order.id,
    correlationId,
});

When debugging, query by correlation ID:

correlation_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"

Every log entry from every service for that request appears. The full chain is visible. This is the single most impactful debugging tool in distributed systems.

Alert Fatigue: Fewer Alerts, Better Alerts

The default approach: alert on everything. Error rate > 0? Alert. Latency > 500ms? Alert. Queue depth > 10? Alert. The result: 200 alerts per day, all ignored.

Alert Design Principles

PrincipleBad AlertGood Alert
Actionable"Error occurred""Payment service error rate > 5% for 10 minutes, affecting checkout"
Threshold with duration"Latency > 500ms" (fires on every spike)"p95 latency > 2s for 5 consecutive minutes"
Business impact"CPU > 80%""Order processing delayed: queue depth growing for 15 minutes"
Not duplicateSame alert fires 50 timesAlert fires once, includes count of occurrences

Alert Tiers

TierResponseChannelExample
P1 CriticalImmediate (wake someone up)PagerDuty/phonePayment processing down, data loss risk
P2 HighWithin 1 hour (business hours)Slack + emailError rate elevated, degraded performance
P3 MediumNext business daySlackDead letter queue growing, non-critical job failing
P4 LowWeekly reviewDashboardDisk usage trending up, certificate expiring in 30 days

The goal: P1 alerts fire less than once per week. If they fire daily, they're either misconfigured or your system has fundamental reliability issues.

Cost Management

Observability is expensive. Log ingestion, metric storage, trace retention, and dashboard hosting add up fast.

ComponentCost DriverOptimization
Log ingestionVolume (GB/day)Filter debug logs in production, sample verbose logs
Log retentionDuration30 days hot, 90 days warm, archive cold
MetricsCardinality (unique label combinations)Avoid high-cardinality labels (user IDs, request IDs)
TracesVolume + retentionTail-based sampling (keep errors, drop routine)
DashboardsUser seats + queriesConsolidate dashboards, remove unused ones

Reduce Log Volume Without Losing Signal

// Don't log every health check
if (req.path === '/health' || req.path === '/ready') {
    return next(); // Skip logging
}

// Sample verbose operations
if (req.path.startsWith('/api/search') && Math.random() > 0.1) {
    ctx.skipLogging = true; // Log only 10% of search requests
}

// Always log errors, business events, and slow requests
logger.info('request_completed', {
    path: req.path,
    status: res.statusCode,
    duration_ms: duration,
    // Only include detailed fields for slow or error requests
    ...(duration > 1000 || res.statusCode >= 400 ? {
        query_params: sanitize(req.query),
        response_size: res.contentLength,
    } : {}),
});

The Observability Maturity Model

LevelCapabilityWhat You Can Answer
L0: Noneconsole.log in production"Something is broken" (maybe)
L1: LogsStructured logging, centralized"What happened?" (with grep)
L2: MetricsKey metrics, basic dashboards"Is the system healthy right now?"
L3: CorrelationCorrelation IDs, distributed tracing"What happened to this specific request?"
L4: AlertingTiered alerts, runbooks"Something is breaking and we know about it immediately"
L5: ProactiveAnomaly detection, SLO-based alerts, cost tracking"Something is about to break"

Most teams are at L1-L2. L3 (correlation IDs) is the biggest single improvement. L4 (good alerting) prevents burnout. L5 is aspirational but achievable.

Common Pitfalls

  1. Unstructured logs. console.log with string concatenation is noise at scale. Use structured JSON with typed fields.

  2. PII in logs. Your log aggregator becomes GDPR-regulated. Log IDs, not values.

  3. Alert on everything. 200 alerts per day means zero alerts are read. Tier alerts by severity and business impact.

  4. High-cardinality metric labels. Using user IDs or request IDs as metric labels creates millions of time series. Use bounded labels (status code, endpoint, tenant).

  5. No correlation IDs. Without them, debugging a request across 7 services requires timestamp guessing and prayer.

  6. No log sampling. Logging every health check and search request at full verbosity doubles your log bill. Sample verbose operations, always log errors.

  7. No cost budget for observability. Observability spend should be 5-15% of infrastructure spend. Track it. Optimize it.

  8. Dashboards nobody looks at. If a dashboard hasn't been viewed in 30 days, delete it. Dashboard sprawl adds cost and confusion.

Key Takeaways

  • Structured logging is non-negotiable. JSON with typed fields. Queryable, filterable, alertable. Never string concatenation.

  • Correlation IDs are the most impactful debugging tool. Generate at the edge, propagate through every service and queue. Query by correlation ID to see the full request chain.

  • No PII in logs, ever. Log customer IDs, not customer emails. Log order IDs, not order contents. Your log infrastructure must not become a data protection liability.

  • Fewer alerts, better alerts. P1 alerts should fire less than once per week. Every alert must be actionable. Include business impact in the alert message.

  • Budget your observability. Log volume, metric cardinality, trace sampling, and retention policies all affect cost. Track observability spend as a percentage of infrastructure spend.

We build observability into every system we deploy across AI services, cloud infrastructure, and custom software. If you need help with production observability, talk to our team or request a quote.

Topics covered

observabilitystructured loggingdistributed tracingmonitoring productionalert fatiguecorrelation IDsPII safe loggingobservability cost

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation