Technical Guide

Observability That Helps at 3am: Logs, Traces, and What Actually Matters

Production observability beyond dashboards. Structured logging, correlation IDs, PII-safe logs, alert fatigue prevention, cost management, and the observability maturity model.

April 10, 202614 min readOronts Engineering Team

The Three Pillars Are Wrong

Every observability talk starts with "the three pillars: logs, metrics, and traces." That framing is incomplete. It tells you what to collect but not what to do with it. The real question at 3am is not "do I have logs?" It's "can I find the one log line that explains why this customer's order failed, across 7 services, in under 2 minutes?"

We run production systems with multiple services, background workers, message queues, and AI pipelines. This article covers what actually helps when things break. For OpenTelemetry-specific implementation patterns, see our OpenTelemetry guide. For AI-specific observability, see our AI observability guide.

Structured Logging: The Format That Saves You

Unstructured logs are useless at scale. console.log('Processing order ' + orderId) becomes noise in a system producing 10,000 log lines per minute.

Structured logs are queryable, filterable, and alertable:

// Unstructured (useless at scale)
console.log('Processing order 12345 for customer sara@beispiel.de');

// Structured (queryable, PII-free)
logger.info('order_processing_started', {
    order_id: 'ord_12345',
    customer_id: 'cust_abc',  // ID, not email
    channel: 'web',
    tenant_id: 'tenant_acme',
    items_count: 3,
    total_cents: 15900,
});

Log Level Discipline

Level	When to Use	Example
`error`	Something broke that needs human attention	Database connection failed, payment declined
`warn`	Something unexpected but handled	Fallback model used, cache miss, retry succeeded
`info`	Business events that matter for auditing	Order created, user logged in, import completed
`debug`	Technical details for development	SQL query executed, cache hit, config loaded

Production should run at info level. debug generates too much volume. error and warn should always trigger review (if not alerts).

No PII in Logs

Your log aggregator (Datadog, CloudWatch, Loki, Elasticsearch) indexes everything. If logs contain emails, names, or phone numbers, your log infrastructure becomes GDPR-regulated.

// Bad: PII in logs
logger.info('Email sent to sara.mustermann@beispiel.de about order 12345');

// Good: IDs only
logger.info('notification_sent', {
    recipient_id: 'cust_abc',
    notification_type: 'order_confirmation',
    order_id: 'ord_12345',
    channel: 'email',
});

For the full PII protection architecture, see our data leakage prevention guide.

Correlation IDs: Tracing a Request Across Services

A single user request might touch an API gateway, an auth service, a product service, a payment service, a notification worker, and a search indexer. Without a correlation ID, finding all log entries for one request requires guessing timestamps and grep.

// Generate at the edge (API gateway or first service)
const correlationId = crypto.randomUUID();

// Pass through every service via headers
const response = await httpClient.post('/api/orders', body, {
    headers: { 'X-Correlation-Id': correlationId },
});

// Include in every log entry
logger.info('order_created', {
    correlation_id: correlationId,
    order_id: order.id,
    tenant_id: ctx.tenantId,
});

// Include in every message sent to queues
await queue.add('send_confirmation', {
    orderId: order.id,
    correlationId,
});

When debugging, query by correlation ID:

correlation_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"

Every log entry from every service for that request appears. The full chain is visible. This is the single most impactful debugging tool in distributed systems.

Alert Fatigue: Fewer Alerts, Better Alerts

The default approach: alert on everything. Error rate > 0? Alert. Latency > 500ms? Alert. Queue depth > 10? Alert. The result: 200 alerts per day, all ignored.

Alert Design Principles

Principle	Bad Alert	Good Alert
Actionable	"Error occurred"	"Payment service error rate > 5% for 10 minutes, affecting checkout"
Threshold with duration	"Latency > 500ms" (fires on every spike)	"p95 latency > 2s for 5 consecutive minutes"
Business impact	"CPU > 80%"	"Order processing delayed: queue depth growing for 15 minutes"
Not duplicate	Same alert fires 50 times	Alert fires once, includes count of occurrences

Alert Tiers

Tier	Response	Channel	Example
P1 Critical	Immediate (wake someone up)	PagerDuty/phone	Payment processing down, data loss risk
P2 High	Within 1 hour (business hours)	Slack + email	Error rate elevated, degraded performance
P3 Medium	Next business day	Slack	Dead letter queue growing, non-critical job failing
P4 Low	Weekly review	Dashboard	Disk usage trending up, certificate expiring in 30 days

The goal: P1 alerts fire less than once per week. If they fire daily, they're either misconfigured or your system has fundamental reliability issues.

Cost Management

Observability is expensive. Log ingestion, metric storage, trace retention, and dashboard hosting add up fast.

Component	Cost Driver	Optimization
Log ingestion	Volume (GB/day)	Filter debug logs in production, sample verbose logs
Log retention	Duration	30 days hot, 90 days warm, archive cold
Metrics	Cardinality (unique label combinations)	Avoid high-cardinality labels (user IDs, request IDs)
Traces	Volume + retention	Tail-based sampling (keep errors, drop routine)
Dashboards	User seats + queries	Consolidate dashboards, remove unused ones

Reduce Log Volume Without Losing Signal

// Don't log every health check
if (req.path === '/health' || req.path === '/ready') {
    return next(); // Skip logging
}

// Sample verbose operations
if (req.path.startsWith('/api/search') && Math.random() > 0.1) {
    ctx.skipLogging = true; // Log only 10% of search requests
}

// Always log errors, business events, and slow requests
logger.info('request_completed', {
    path: req.path,
    status: res.statusCode,
    duration_ms: duration,
    // Only include detailed fields for slow or error requests
    ...(duration > 1000 || res.statusCode >= 400 ? {
        query_params: sanitize(req.query),
        response_size: res.contentLength,
    } : {}),
});

The Observability Maturity Model

Level	Capability	What You Can Answer
L0: None	`console.log` in production	"Something is broken" (maybe)
L1: Logs	Structured logging, centralized	"What happened?" (with grep)
L2: Metrics	Key metrics, basic dashboards	"Is the system healthy right now?"
L3: Correlation	Correlation IDs, distributed tracing	"What happened to this specific request?"
L4: Alerting	Tiered alerts, runbooks	"Something is breaking and we know about it immediately"
L5: Proactive	Anomaly detection, SLO-based alerts, cost tracking	"Something is about to break"

Most teams are at L1-L2. L3 (correlation IDs) is the biggest single improvement. L4 (good alerting) prevents burnout. L5 is aspirational but achievable.

Common Pitfalls

Unstructured logs. console.log with string concatenation is noise at scale. Use structured JSON with typed fields.
PII in logs. Your log aggregator becomes GDPR-regulated. Log IDs, not values.
Alert on everything. 200 alerts per day means zero alerts are read. Tier alerts by severity and business impact.
High-cardinality metric labels. Using user IDs or request IDs as metric labels creates millions of time series. Use bounded labels (status code, endpoint, tenant).
No correlation IDs. Without them, debugging a request across 7 services requires timestamp guessing and prayer.
No log sampling. Logging every health check and search request at full verbosity doubles your log bill. Sample verbose operations, always log errors.
No cost budget for observability. Observability spend should be 5-15% of infrastructure spend. Track it. Optimize it.
Dashboards nobody looks at. If a dashboard hasn't been viewed in 30 days, delete it. Dashboard sprawl adds cost and confusion.

Key Takeaways

Structured logging is non-negotiable. JSON with typed fields. Queryable, filterable, alertable. Never string concatenation.
Correlation IDs are the most impactful debugging tool. Generate at the edge, propagate through every service and queue. Query by correlation ID to see the full request chain.
No PII in logs, ever. Log customer IDs, not customer emails. Log order IDs, not order contents. Your log infrastructure must not become a data protection liability.
Fewer alerts, better alerts. P1 alerts should fire less than once per week. Every alert must be actionable. Include business impact in the alert message.
Budget your observability. Log volume, metric cardinality, trace sampling, and retention policies all affect cost. Track observability spend as a percentage of infrastructure spend.

We build observability into every system we deploy across AI services, cloud infrastructure, and custom software. If you need help with production observability, talk to our team or request a quote.

Topics covered

observabilitystructured loggingdistributed tracingmonitoring productionalert fatiguecorrelation IDsPII safe loggingobservability cost

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

Observability That Helps at 3am: Logs, Traces, and What Actually Matters

The Three Pillars Are Wrong

Structured Logging: The Format That Saves You

Log Level Discipline

No PII in Logs

Correlation IDs: Tracing a Request Across Services

Alert Fatigue: Fewer Alerts, Better Alerts

Alert Design Principles

Alert Tiers

Cost Management

Reduce Log Volume Without Losing Signal

The Observability Maturity Model

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

OpenTelemetry in Production: Traces, Context, and What Actually Matters

Enterprise Guide to Agentic AI Systems

Agentic Commerce: How to Let AI Agents Buy Things Safely

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal