Technical Guide

Designing Systems for Failure (Because They Will Fail)

Failure response patterns for production systems. Circuit breakers, retry strategies, graceful degradation, dead letter handling, timeout budgets, and chaos engineering for small teams.

March 31, 202614 min readOronts Engineering Team

The Failure Taxonomy

Systems fail in four ways. Each requires a different response.

TypeDescriptionExampleCorrect Response
TransientBrief glitch, resolves itselfNetwork timeout, connection resetRetry with backoff
PermanentBroken until someone fixes itInvalid config, schema mismatchFail fast, alert, don't retry
PartialSome functionality worksOne of three suppliers is downDegrade gracefully, serve what works
CascadingOne failure triggers othersDatabase overload causes all services to timeoutCircuit breaker, shed load

The most dangerous mistake: treating every failure the same way. Retrying a permanent failure wastes resources. Not retrying a transient failure degrades user experience. Not detecting a cascading failure brings down the entire system.

For how we handle failures specifically in event-driven systems, see our event-driven architecture guide. For AI system failures, see our AI failure modes guide.

Circuit Breakers

A circuit breaker prevents a failing service from bringing down everything that depends on it. When a downstream service fails repeatedly, the circuit breaker "opens" and short-circuits requests immediately instead of waiting for timeouts.

class CircuitBreaker {
    private failures = 0;
    private lastFailure = 0;
    private state: 'closed' | 'open' | 'half-open' = 'closed';

    constructor(
        private threshold: number = 5,      // failures before opening
        private resetTimeout: number = 30000, // ms before trying again
    ) {}

    async execute<T>(fn: () => Promise<T>): Promise<T> {
        if (this.state === 'open') {
            if (Date.now() - this.lastFailure > this.resetTimeout) {
                this.state = 'half-open'; // Try one request
            } else {
                throw new CircuitOpenError('Circuit is open');
            }
        }

        try {
            const result = await fn();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }

    private onSuccess() {
        this.failures = 0;
        this.state = 'closed';
    }

    private onFailure() {
        this.failures++;
        this.lastFailure = Date.now();
        if (this.failures >= this.threshold) {
            this.state = 'open';
        }
    }
}

// Usage
const supplierBreaker = new CircuitBreaker(5, 30000);

async function checkAvailability(productId: string) {
    return supplierBreaker.execute(async () => {
        return await supplierApi.checkAvailability(productId);
    });
}

The Common Mistake

Most teams implement the circuit breaker but don't handle the open state. When the circuit is open, what does the user see? A 500 error is the wrong answer. The right answer depends on the context:

ContextWhen Circuit OpensCorrect Behavior
Product searchOne supplier downShow results from other suppliers
Price checkPricing service downShow cached price with "as of X" label
CheckoutPayment gateway downQueue the order, process later
RecommendationML service downShow popular items instead
Image serviceCDN downShow placeholder image

Retry Strategies

Not all retries are equal. The strategy depends on the failure type.

interface RetryConfig {
    maxAttempts: number;
    strategy: 'immediate' | 'fixed' | 'exponential' | 'none';
    baseDelay: number;      // ms
    maxDelay: number;        // ms
    jitter: boolean;         // randomize to prevent thundering herd
}

const RETRY_CONFIGS: Record<string, RetryConfig> = {
    // Transient: retry with exponential backoff
    network_timeout: {
        maxAttempts: 3,
        strategy: 'exponential',
        baseDelay: 1000,     // 1s, 2s, 4s
        maxDelay: 10000,
        jitter: true,
    },
    // Rate limited: retry with fixed delay
    rate_limited: {
        maxAttempts: 5,
        strategy: 'fixed',
        baseDelay: 5000,     // wait 5s between attempts
        maxDelay: 5000,
        jitter: false,
    },
    // Optimistic lock conflict: retry immediately
    lock_conflict: {
        maxAttempts: 3,
        strategy: 'immediate',
        baseDelay: 0,
        maxDelay: 0,
        jitter: false,
    },
    // Permanent failure: don't retry
    validation_error: {
        maxAttempts: 1,
        strategy: 'none',
        baseDelay: 0,
        maxDelay: 0,
        jitter: false,
    },
};

async function retryWithStrategy<T>(
    fn: () => Promise<T>,
    config: RetryConfig,
): Promise<T> {
    let lastError: Error;

    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await fn();
        } catch (error) {
            lastError = error;
            if (attempt === config.maxAttempts - 1) break;
            if (config.strategy === 'none') break;

            let delay = config.baseDelay;
            if (config.strategy === 'exponential') {
                delay = Math.min(config.baseDelay * Math.pow(2, attempt), config.maxDelay);
            }
            if (config.jitter) {
                delay += Math.random() * delay * 0.5; // 0-50% jitter
            }

            await sleep(delay);
        }
    }
    throw lastError;
}

Jitter is critical. Without it, all clients retry at the same time after a failure (thundering herd). Jitter spreads retries across a time window, reducing the spike on the recovering service.

Graceful Degradation

When a dependency fails, serve what you can instead of failing entirely.

async function getProductPage(productId: string): Promise<ProductPageData> {
    // Core data: must succeed
    const product = await productService.getById(productId);
    if (!product) throw new NotFoundError();

    // Non-critical data: degrade gracefully
    const [reviews, recommendations, availability] = await Promise.allSettled([
        reviewService.getForProduct(productId),
        recommendationService.getSimilar(productId),
        inventoryService.checkStock(productId),
    ]);

    return {
        product,
        reviews: reviews.status === 'fulfilled' ? reviews.value : [],
        recommendations: recommendations.status === 'fulfilled' ? recommendations.value : [],
        availability: availability.status === 'fulfilled'
            ? availability.value
            : { status: 'unknown', message: 'Check availability in store' },
    };
}

Promise.allSettled is the key. Unlike Promise.all, it doesn't fail if one promise rejects. Each result is independently settled. The product page renders with whatever data is available.

Stale Data Is Better Than No Data

async function getProductPrice(productId: string): Promise<PriceInfo> {
    try {
        const livePrice = await pricingService.getPrice(productId);
        await cache.set(`price:${productId}`, livePrice, { ttl: 300 });
        return livePrice;
    } catch (error) {
        // Pricing service down: serve cached price
        const cached = await cache.get(`price:${productId}`);
        if (cached) {
            return { ...cached, stale: true, staleSince: cached.cachedAt };
        }
        // No cache either: return catalog price
        const product = await productService.getById(productId);
        return { price: product.listPrice, stale: true, approximate: true };
    }
}

A stale price from cache is better than a 500 error. An approximate catalog price is better than no price at all. Always have a fallback, even if it's less accurate.

Timeout Budgets

Every operation has a time budget. If the budget is exceeded, fail fast instead of making the user wait forever.

const TIMEOUT_BUDGETS = {
    api_request: 5000,          // 5s total for any API request
    database_query: 2000,       // 2s for any database query
    external_api: 3000,         // 3s for external service calls
    llm_generation: 30000,      // 30s for AI generation (streaming)
    search_query: 1000,         // 1s for search
    cache_operation: 100,       // 100ms for cache read/write
};

async function withTimeout<T>(promise: Promise<T>, budget: number, label: string): Promise<T> {
    const timeout = new Promise<never>((_, reject) => {
        setTimeout(() => reject(new TimeoutError(`${label} exceeded ${budget}ms budget`)), budget);
    });
    return Promise.race([promise, timeout]);
}

// Usage
const results = await withTimeout(
    searchService.query(userQuery),
    TIMEOUT_BUDGETS.search_query,
    'product_search',
);

Cascading Timeout Protection

When service A calls service B which calls service C, each service should subtract its own processing time from the remaining budget:

API Gateway (5s budget)
  └── Auth check (100ms used, 4.9s remaining)
       └── Product service (200ms used, 4.7s remaining)
            └── Pricing service (timeout: 4.7s, not the original 5s)

Without cascading budgets, the pricing service uses a full 3s timeout even though the API gateway only has 4.7s left. If pricing takes 3s, the gateway times out before the response arrives, wasting all the work.

Testing Failure

Chaos Engineering for Small Teams

You don't need Netflix's Chaos Monkey to test failure handling. Start with simple fault injection:

// Middleware: inject failures in staging
function chaosMiddleware(req: Request, res: Response, next: NextFunction) {
    if (process.env.NODE_ENV !== 'staging') return next();

    const chaos = req.headers['x-chaos'];
    if (chaos === 'latency') {
        setTimeout(next, 3000); // Add 3s latency
    } else if (chaos === 'error') {
        res.status(500).json({ error: 'Chaos: injected failure' });
    } else if (chaos === 'timeout') {
        // Don't respond at all (simulate hung service)
    } else {
        next();
    }
}

Test scenarios that matter:

ScenarioHow to TestWhat to Verify
Database downStop database in stagingCircuit breaker opens, cached data served
Slow dependencyInject 5s latencyTimeout fires, degraded response returned
Queue fullFill queue with test messagesBackpressure applied, no data loss
Memory pressureLimit container memoryOOM handling, graceful restart
Certificate expiryUse short-lived cert in stagingAlert fires before expiry

Common Pitfalls

  1. Same retry strategy for all failures. Timeout needs backoff. Invalid input needs zero retries. Rate limit needs fixed delay.

  2. Circuit breaker without fallback. Opening the circuit and returning 500 is not fault tolerance. Serve cached data, degraded results, or a queued response.

  3. No jitter on retries. All clients retry at the same time, overwhelming the recovering service. Add random jitter.

  4. Infinite timeouts. A request that waits forever blocks a connection and a thread. Every operation needs a timeout budget.

  5. Testing only the happy path. If you've never tested what happens when the database is down, you don't know if your fallbacks work.

  6. Cascading failures from shared dependencies. If services A, B, and C all depend on the same database, and the database is slow, all three services become slow. Circuit breakers on shared dependencies prevent cascade.

Key Takeaways

  • Classify failures before responding. Transient (retry), permanent (fail fast), partial (degrade), cascading (circuit break). Each type needs a different strategy.

  • Circuit breakers need fallbacks. Opening the circuit is not the fix. Serving cached data, alternative results, or queued responses is the fix.

  • Jitter prevents thundering herd. Randomize retry delays. Without jitter, synchronized retries make recovery harder.

  • Stale data is better than no data. A cached price from 5 minutes ago is better than a 500 error. Always have a fallback path.

  • Timeout budgets cascade. Each service in the chain subtracts its processing time from the remaining budget. Don't let inner services use more time than the outer service has left.

  • Test failure in staging. Simple fault injection (latency, errors, hung connections) verifies that your resilience patterns actually work.

We design resilient systems as part of our custom software and cloud practice. If you need help with reliability engineering, talk to our team or request a quote.

Topics covered

resilience patternscircuit breakerretry strategygraceful degradationsystem reliabilitytimeout budgetchaos engineeringdead letter queue

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation