Technical Guide

Designing Systems for Failure (Because They Will Fail)

Failure response patterns for production systems. Circuit breakers, retry strategies, graceful degradation, dead letter handling, timeout budgets, and chaos engineering for small teams.

March 31, 202614 min readOronts Engineering Team

The Failure Taxonomy

Systems fail in four ways. Each requires a different response.

Type	Description	Example	Correct Response
Transient	Brief glitch, resolves itself	Network timeout, connection reset	Retry with backoff
Permanent	Broken until someone fixes it	Invalid config, schema mismatch	Fail fast, alert, don't retry
Partial	Some functionality works	One of three suppliers is down	Degrade gracefully, serve what works
Cascading	One failure triggers others	Database overload causes all services to timeout	Circuit breaker, shed load

The most dangerous mistake: treating every failure the same way. Retrying a permanent failure wastes resources. Not retrying a transient failure degrades user experience. Not detecting a cascading failure brings down the entire system.

For how we handle failures specifically in event-driven systems, see our event-driven architecture guide. For AI system failures, see our AI failure modes guide.

Circuit Breakers

A circuit breaker prevents a failing service from bringing down everything that depends on it. When a downstream service fails repeatedly, the circuit breaker "opens" and short-circuits requests immediately instead of waiting for timeouts.

class CircuitBreaker {
    private failures = 0;
    private lastFailure = 0;
    private state: 'closed' | 'open' | 'half-open' = 'closed';

    constructor(
        private threshold: number = 5,      // failures before opening
        private resetTimeout: number = 30000, // ms before trying again
    ) {}

    async execute<T>(fn: () => Promise<T>): Promise<T> {
        if (this.state === 'open') {
            if (Date.now() - this.lastFailure > this.resetTimeout) {
                this.state = 'half-open'; // Try one request
            } else {
                throw new CircuitOpenError('Circuit is open');
            }
        }

        try {
            const result = await fn();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }

    private onSuccess() {
        this.failures = 0;
        this.state = 'closed';
    }

    private onFailure() {
        this.failures++;
        this.lastFailure = Date.now();
        if (this.failures >= this.threshold) {
            this.state = 'open';
        }
    }
}

// Usage
const supplierBreaker = new CircuitBreaker(5, 30000);

async function checkAvailability(productId: string) {
    return supplierBreaker.execute(async () => {
        return await supplierApi.checkAvailability(productId);
    });
}

The Common Mistake

Most teams implement the circuit breaker but don't handle the open state. When the circuit is open, what does the user see? A 500 error is the wrong answer. The right answer depends on the context:

Context	When Circuit Opens	Correct Behavior
Product search	One supplier down	Show results from other suppliers
Price check	Pricing service down	Show cached price with "as of X" label
Checkout	Payment gateway down	Queue the order, process later
Recommendation	ML service down	Show popular items instead
Image service	CDN down	Show placeholder image

Retry Strategies

Not all retries are equal. The strategy depends on the failure type.

interface RetryConfig {
    maxAttempts: number;
    strategy: 'immediate' | 'fixed' | 'exponential' | 'none';
    baseDelay: number;      // ms
    maxDelay: number;        // ms
    jitter: boolean;         // randomize to prevent thundering herd
}

const RETRY_CONFIGS: Record<string, RetryConfig> = {
    // Transient: retry with exponential backoff
    network_timeout: {
        maxAttempts: 3,
        strategy: 'exponential',
        baseDelay: 1000,     // 1s, 2s, 4s
        maxDelay: 10000,
        jitter: true,
    },
    // Rate limited: retry with fixed delay
    rate_limited: {
        maxAttempts: 5,
        strategy: 'fixed',
        baseDelay: 5000,     // wait 5s between attempts
        maxDelay: 5000,
        jitter: false,
    },
    // Optimistic lock conflict: retry immediately
    lock_conflict: {
        maxAttempts: 3,
        strategy: 'immediate',
        baseDelay: 0,
        maxDelay: 0,
        jitter: false,
    },
    // Permanent failure: don't retry
    validation_error: {
        maxAttempts: 1,
        strategy: 'none',
        baseDelay: 0,
        maxDelay: 0,
        jitter: false,
    },
};

async function retryWithStrategy<T>(
    fn: () => Promise<T>,
    config: RetryConfig,
): Promise<T> {
    let lastError: Error;

    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await fn();
        } catch (error) {
            lastError = error;
            if (attempt === config.maxAttempts - 1) break;
            if (config.strategy === 'none') break;

            let delay = config.baseDelay;
            if (config.strategy === 'exponential') {
                delay = Math.min(config.baseDelay * Math.pow(2, attempt), config.maxDelay);
            }
            if (config.jitter) {
                delay += Math.random() * delay * 0.5; // 0-50% jitter
            }

            await sleep(delay);
        }
    }
    throw lastError;
}

Jitter is critical. Without it, all clients retry at the same time after a failure (thundering herd). Jitter spreads retries across a time window, reducing the spike on the recovering service.

Graceful Degradation

When a dependency fails, serve what you can instead of failing entirely.

async function getProductPage(productId: string): Promise<ProductPageData> {
    // Core data: must succeed
    const product = await productService.getById(productId);
    if (!product) throw new NotFoundError();

    // Non-critical data: degrade gracefully
    const [reviews, recommendations, availability] = await Promise.allSettled([
        reviewService.getForProduct(productId),
        recommendationService.getSimilar(productId),
        inventoryService.checkStock(productId),
    ]);

    return {
        product,
        reviews: reviews.status === 'fulfilled' ? reviews.value : [],
        recommendations: recommendations.status === 'fulfilled' ? recommendations.value : [],
        availability: availability.status === 'fulfilled'
            ? availability.value
            : { status: 'unknown', message: 'Check availability in store' },
    };
}

Promise.allSettled is the key. Unlike Promise.all, it doesn't fail if one promise rejects. Each result is independently settled. The product page renders with whatever data is available.

Stale Data Is Better Than No Data

async function getProductPrice(productId: string): Promise<PriceInfo> {
    try {
        const livePrice = await pricingService.getPrice(productId);
        await cache.set(`price:${productId}`, livePrice, { ttl: 300 });
        return livePrice;
    } catch (error) {
        // Pricing service down: serve cached price
        const cached = await cache.get(`price:${productId}`);
        if (cached) {
            return { ...cached, stale: true, staleSince: cached.cachedAt };
        }
        // No cache either: return catalog price
        const product = await productService.getById(productId);
        return { price: product.listPrice, stale: true, approximate: true };
    }
}

A stale price from cache is better than a 500 error. An approximate catalog price is better than no price at all. Always have a fallback, even if it's less accurate.

Timeout Budgets

Every operation has a time budget. If the budget is exceeded, fail fast instead of making the user wait forever.

const TIMEOUT_BUDGETS = {
    api_request: 5000,          // 5s total for any API request
    database_query: 2000,       // 2s for any database query
    external_api: 3000,         // 3s for external service calls
    llm_generation: 30000,      // 30s for AI generation (streaming)
    search_query: 1000,         // 1s for search
    cache_operation: 100,       // 100ms for cache read/write
};

async function withTimeout<T>(promise: Promise<T>, budget: number, label: string): Promise<T> {
    const timeout = new Promise<never>((_, reject) => {
        setTimeout(() => reject(new TimeoutError(`${label} exceeded ${budget}ms budget`)), budget);
    });
    return Promise.race([promise, timeout]);
}

// Usage
const results = await withTimeout(
    searchService.query(userQuery),
    TIMEOUT_BUDGETS.search_query,
    'product_search',
);

Cascading Timeout Protection

When service A calls service B which calls service C, each service should subtract its own processing time from the remaining budget:

API Gateway (5s budget)
  └── Auth check (100ms used, 4.9s remaining)
       └── Product service (200ms used, 4.7s remaining)
            └── Pricing service (timeout: 4.7s, not the original 5s)

Without cascading budgets, the pricing service uses a full 3s timeout even though the API gateway only has 4.7s left. If pricing takes 3s, the gateway times out before the response arrives, wasting all the work.

Testing Failure

Chaos Engineering for Small Teams

You don't need Netflix's Chaos Monkey to test failure handling. Start with simple fault injection:

// Middleware: inject failures in staging
function chaosMiddleware(req: Request, res: Response, next: NextFunction) {
    if (process.env.NODE_ENV !== 'staging') return next();

    const chaos = req.headers['x-chaos'];
    if (chaos === 'latency') {
        setTimeout(next, 3000); // Add 3s latency
    } else if (chaos === 'error') {
        res.status(500).json({ error: 'Chaos: injected failure' });
    } else if (chaos === 'timeout') {
        // Don't respond at all (simulate hung service)
    } else {
        next();
    }
}

Test scenarios that matter:

Scenario	How to Test	What to Verify
Database down	Stop database in staging	Circuit breaker opens, cached data served
Slow dependency	Inject 5s latency	Timeout fires, degraded response returned
Queue full	Fill queue with test messages	Backpressure applied, no data loss
Memory pressure	Limit container memory	OOM handling, graceful restart
Certificate expiry	Use short-lived cert in staging	Alert fires before expiry

Common Pitfalls

Same retry strategy for all failures. Timeout needs backoff. Invalid input needs zero retries. Rate limit needs fixed delay.
Circuit breaker without fallback. Opening the circuit and returning 500 is not fault tolerance. Serve cached data, degraded results, or a queued response.
No jitter on retries. All clients retry at the same time, overwhelming the recovering service. Add random jitter.
Infinite timeouts. A request that waits forever blocks a connection and a thread. Every operation needs a timeout budget.
Testing only the happy path. If you've never tested what happens when the database is down, you don't know if your fallbacks work.
Cascading failures from shared dependencies. If services A, B, and C all depend on the same database, and the database is slow, all three services become slow. Circuit breakers on shared dependencies prevent cascade.

Key Takeaways

Classify failures before responding. Transient (retry), permanent (fail fast), partial (degrade), cascading (circuit break). Each type needs a different strategy.
Circuit breakers need fallbacks. Opening the circuit is not the fix. Serving cached data, alternative results, or queued responses is the fix.
Jitter prevents thundering herd. Randomize retry delays. Without jitter, synchronized retries make recovery harder.
Stale data is better than no data. A cached price from 5 minutes ago is better than a 500 error. Always have a fallback path.
Timeout budgets cascade. Each service in the chain subtracts its processing time from the remaining budget. Don't let inner services use more time than the outer service has left.
Test failure in staging. Simple fault injection (latency, errors, hung connections) verifies that your resilience patterns actually work.

We design resilient systems as part of our custom software and cloud practice. If you need help with reliability engineering, talk to our team or request a quote.

Topics covered

resilience patternscircuit breakerretry strategygraceful degradationsystem reliabilitytimeout budgetchaos engineeringdead letter queue

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

Designing Systems for Failure (Because They Will Fail)

The Failure Taxonomy

Circuit Breakers

The Common Mistake

Retry Strategies

Graceful Degradation

Stale Data Is Better Than No Data

Timeout Budgets

Cascading Timeout Protection

Testing Failure

Chaos Engineering for Small Teams

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

AI Failure Modes: A Production Engineering Guide

Event-Driven Architecture in Practice: What Actually Goes Wrong

Enterprise Guide to Agentic AI Systems

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal