Designing Systems for Failure (Because They Will Fail)
Failure response patterns for production systems. Circuit breakers, retry strategies, graceful degradation, dead letter handling, timeout budgets, and chaos engineering for small teams.
The Failure Taxonomy
Systems fail in four ways. Each requires a different response.
| Type | Description | Example | Correct Response |
|---|---|---|---|
| Transient | Brief glitch, resolves itself | Network timeout, connection reset | Retry with backoff |
| Permanent | Broken until someone fixes it | Invalid config, schema mismatch | Fail fast, alert, don't retry |
| Partial | Some functionality works | One of three suppliers is down | Degrade gracefully, serve what works |
| Cascading | One failure triggers others | Database overload causes all services to timeout | Circuit breaker, shed load |
The most dangerous mistake: treating every failure the same way. Retrying a permanent failure wastes resources. Not retrying a transient failure degrades user experience. Not detecting a cascading failure brings down the entire system.
For how we handle failures specifically in event-driven systems, see our event-driven architecture guide. For AI system failures, see our AI failure modes guide.
Circuit Breakers
A circuit breaker prevents a failing service from bringing down everything that depends on it. When a downstream service fails repeatedly, the circuit breaker "opens" and short-circuits requests immediately instead of waiting for timeouts.
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold: number = 5, // failures before opening
private resetTimeout: number = 30000, // ms before trying again
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = 'half-open'; // Try one request
} else {
throw new CircuitOpenError('Circuit is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}
// Usage
const supplierBreaker = new CircuitBreaker(5, 30000);
async function checkAvailability(productId: string) {
return supplierBreaker.execute(async () => {
return await supplierApi.checkAvailability(productId);
});
}
The Common Mistake
Most teams implement the circuit breaker but don't handle the open state. When the circuit is open, what does the user see? A 500 error is the wrong answer. The right answer depends on the context:
| Context | When Circuit Opens | Correct Behavior |
|---|---|---|
| Product search | One supplier down | Show results from other suppliers |
| Price check | Pricing service down | Show cached price with "as of X" label |
| Checkout | Payment gateway down | Queue the order, process later |
| Recommendation | ML service down | Show popular items instead |
| Image service | CDN down | Show placeholder image |
Retry Strategies
Not all retries are equal. The strategy depends on the failure type.
interface RetryConfig {
maxAttempts: number;
strategy: 'immediate' | 'fixed' | 'exponential' | 'none';
baseDelay: number; // ms
maxDelay: number; // ms
jitter: boolean; // randomize to prevent thundering herd
}
const RETRY_CONFIGS: Record<string, RetryConfig> = {
// Transient: retry with exponential backoff
network_timeout: {
maxAttempts: 3,
strategy: 'exponential',
baseDelay: 1000, // 1s, 2s, 4s
maxDelay: 10000,
jitter: true,
},
// Rate limited: retry with fixed delay
rate_limited: {
maxAttempts: 5,
strategy: 'fixed',
baseDelay: 5000, // wait 5s between attempts
maxDelay: 5000,
jitter: false,
},
// Optimistic lock conflict: retry immediately
lock_conflict: {
maxAttempts: 3,
strategy: 'immediate',
baseDelay: 0,
maxDelay: 0,
jitter: false,
},
// Permanent failure: don't retry
validation_error: {
maxAttempts: 1,
strategy: 'none',
baseDelay: 0,
maxDelay: 0,
jitter: false,
},
};
async function retryWithStrategy<T>(
fn: () => Promise<T>,
config: RetryConfig,
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
if (attempt === config.maxAttempts - 1) break;
if (config.strategy === 'none') break;
let delay = config.baseDelay;
if (config.strategy === 'exponential') {
delay = Math.min(config.baseDelay * Math.pow(2, attempt), config.maxDelay);
}
if (config.jitter) {
delay += Math.random() * delay * 0.5; // 0-50% jitter
}
await sleep(delay);
}
}
throw lastError;
}
Jitter is critical. Without it, all clients retry at the same time after a failure (thundering herd). Jitter spreads retries across a time window, reducing the spike on the recovering service.
Graceful Degradation
When a dependency fails, serve what you can instead of failing entirely.
async function getProductPage(productId: string): Promise<ProductPageData> {
// Core data: must succeed
const product = await productService.getById(productId);
if (!product) throw new NotFoundError();
// Non-critical data: degrade gracefully
const [reviews, recommendations, availability] = await Promise.allSettled([
reviewService.getForProduct(productId),
recommendationService.getSimilar(productId),
inventoryService.checkStock(productId),
]);
return {
product,
reviews: reviews.status === 'fulfilled' ? reviews.value : [],
recommendations: recommendations.status === 'fulfilled' ? recommendations.value : [],
availability: availability.status === 'fulfilled'
? availability.value
: { status: 'unknown', message: 'Check availability in store' },
};
}
Promise.allSettled is the key. Unlike Promise.all, it doesn't fail if one promise rejects. Each result is independently settled. The product page renders with whatever data is available.
Stale Data Is Better Than No Data
async function getProductPrice(productId: string): Promise<PriceInfo> {
try {
const livePrice = await pricingService.getPrice(productId);
await cache.set(`price:${productId}`, livePrice, { ttl: 300 });
return livePrice;
} catch (error) {
// Pricing service down: serve cached price
const cached = await cache.get(`price:${productId}`);
if (cached) {
return { ...cached, stale: true, staleSince: cached.cachedAt };
}
// No cache either: return catalog price
const product = await productService.getById(productId);
return { price: product.listPrice, stale: true, approximate: true };
}
}
A stale price from cache is better than a 500 error. An approximate catalog price is better than no price at all. Always have a fallback, even if it's less accurate.
Timeout Budgets
Every operation has a time budget. If the budget is exceeded, fail fast instead of making the user wait forever.
const TIMEOUT_BUDGETS = {
api_request: 5000, // 5s total for any API request
database_query: 2000, // 2s for any database query
external_api: 3000, // 3s for external service calls
llm_generation: 30000, // 30s for AI generation (streaming)
search_query: 1000, // 1s for search
cache_operation: 100, // 100ms for cache read/write
};
async function withTimeout<T>(promise: Promise<T>, budget: number, label: string): Promise<T> {
const timeout = new Promise<never>((_, reject) => {
setTimeout(() => reject(new TimeoutError(`${label} exceeded ${budget}ms budget`)), budget);
});
return Promise.race([promise, timeout]);
}
// Usage
const results = await withTimeout(
searchService.query(userQuery),
TIMEOUT_BUDGETS.search_query,
'product_search',
);
Cascading Timeout Protection
When service A calls service B which calls service C, each service should subtract its own processing time from the remaining budget:
API Gateway (5s budget)
└── Auth check (100ms used, 4.9s remaining)
└── Product service (200ms used, 4.7s remaining)
└── Pricing service (timeout: 4.7s, not the original 5s)
Without cascading budgets, the pricing service uses a full 3s timeout even though the API gateway only has 4.7s left. If pricing takes 3s, the gateway times out before the response arrives, wasting all the work.
Testing Failure
Chaos Engineering for Small Teams
You don't need Netflix's Chaos Monkey to test failure handling. Start with simple fault injection:
// Middleware: inject failures in staging
function chaosMiddleware(req: Request, res: Response, next: NextFunction) {
if (process.env.NODE_ENV !== 'staging') return next();
const chaos = req.headers['x-chaos'];
if (chaos === 'latency') {
setTimeout(next, 3000); // Add 3s latency
} else if (chaos === 'error') {
res.status(500).json({ error: 'Chaos: injected failure' });
} else if (chaos === 'timeout') {
// Don't respond at all (simulate hung service)
} else {
next();
}
}
Test scenarios that matter:
| Scenario | How to Test | What to Verify |
|---|---|---|
| Database down | Stop database in staging | Circuit breaker opens, cached data served |
| Slow dependency | Inject 5s latency | Timeout fires, degraded response returned |
| Queue full | Fill queue with test messages | Backpressure applied, no data loss |
| Memory pressure | Limit container memory | OOM handling, graceful restart |
| Certificate expiry | Use short-lived cert in staging | Alert fires before expiry |
Common Pitfalls
-
Same retry strategy for all failures. Timeout needs backoff. Invalid input needs zero retries. Rate limit needs fixed delay.
-
Circuit breaker without fallback. Opening the circuit and returning 500 is not fault tolerance. Serve cached data, degraded results, or a queued response.
-
No jitter on retries. All clients retry at the same time, overwhelming the recovering service. Add random jitter.
-
Infinite timeouts. A request that waits forever blocks a connection and a thread. Every operation needs a timeout budget.
-
Testing only the happy path. If you've never tested what happens when the database is down, you don't know if your fallbacks work.
-
Cascading failures from shared dependencies. If services A, B, and C all depend on the same database, and the database is slow, all three services become slow. Circuit breakers on shared dependencies prevent cascade.
Key Takeaways
-
Classify failures before responding. Transient (retry), permanent (fail fast), partial (degrade), cascading (circuit break). Each type needs a different strategy.
-
Circuit breakers need fallbacks. Opening the circuit is not the fix. Serving cached data, alternative results, or queued responses is the fix.
-
Jitter prevents thundering herd. Randomize retry delays. Without jitter, synchronized retries make recovery harder.
-
Stale data is better than no data. A cached price from 5 minutes ago is better than a 500 error. Always have a fallback path.
-
Timeout budgets cascade. Each service in the chain subtracts its processing time from the remaining budget. Don't let inner services use more time than the outer service has left.
-
Test failure in staging. Simple fault injection (latency, errors, hung connections) verifies that your resilience patterns actually work.
We design resilient systems as part of our custom software and cloud practice. If you need help with reliability engineering, talk to our team or request a quote.
Topics covered
Related Guides
AI Failure Modes: A Production Engineering Guide
Technical guide to AI failures in production. Learn about hallucinations, context limits, prompt injection, model drift, and building resilient AI apps.
Read guideEvent-Driven Architecture in Practice: What Actually Goes Wrong
Real event-driven architecture patterns from production. Event storms, bidirectional sync loops, dead letters, idempotency stores, and choosing between Kafka, RabbitMQ, BullMQ, and Symfony Messenger.
Read guideEnterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation