Event-Driven Architecture in Practice: What Actually Goes Wrong
Real event-driven architecture patterns from production. Event storms, bidirectional sync loops, dead letters, idempotency stores, and choosing between Kafka, RabbitMQ, BullMQ, and Symfony Messenger.
The Event Storm Nobody Expects
Event-driven architecture looks clean on a whiteboard. Service A emits an event. Service B consumes it. Loose coupling. Independent scaling. Beautiful diagrams.
Then you deploy to production. One product save triggers 8 event subscribers. Each subscriber dispatches 3 async messages. Each message handler loads the product, modifies a field, and saves. Each save triggers 8 subscribers again. Within seconds, your message queue has thousands of messages for a single product update, your workers are saturated, and your database is under lock contention.
We've operated four different event-driven systems in production. Each used a different message broker. Each had different failure patterns. This article covers what actually goes wrong and how to fix it. For broader context on how we approach system architecture, that guide covers our methodology.
The Four Failure Patterns
Every event-driven system eventually encounters these:
| Pattern | What Happens | Severity |
|---|---|---|
| Event storm | One action cascades into thousands of messages | Critical (can saturate infrastructure) |
| Bidirectional sync loop | System A syncs to B, B syncs back to A, infinite loop | Critical (infinite message growth) |
| Duplicate processing | Same message processed twice, creates duplicate data | High (data integrity) |
| Dead letter accumulation | Failed messages pile up with no resolution strategy | Medium (operational debt) |
1. Event Storms
The most common and most dangerous pattern. It happens when event handlers trigger new events, and those events trigger more handlers.
Product save
-> EventSubscriber A dispatches MessageA
-> EventSubscriber B dispatches MessageB
-> EventSubscriber C dispatches MessageC
-> ...8 subscribers total
WorkerA processes MessageA
-> Modifies product, calls save()
-> Product save triggers 8 subscribers again
-> Each dispatches more messages
WorkerB processes MessageB (same pattern)
-> 8 more messages
Result: 1 save -> 8 messages -> 8 saves -> 64 messages -> ...
The solution: EventSubscriberSupervisor. A process-scoped control that disables event subscribers during worker saves.
// Worker handler: disable subscribers before saving
async function handleGenerateThumbnail(message: ThumbnailMessage) {
const product = await productService.findById(message.productId);
subscriberSupervisor.disableAll();
try {
product.thumbnail = await generateThumbnail(product);
await product.save(); // No subscribers fire, no new messages dispatched
} finally {
subscriberSupervisor.enableAll();
}
}
The order matters. The save must happen inside the disabled scope. If you re-enable subscribers before saving, the save triggers the cascade.
For a deeper look at how we apply this pattern specifically in Pimcore, see our Pimcore workflow design guide.
2. Bidirectional Sync Loops
When two systems sync data bidirectionally, each update from A triggers a sync to B, which triggers a sync back to A. Without loop prevention, this runs forever.
System A (Commerce) System B (Operations)
β β
β Customer updated β
β βββββββββββββββββββββββββΆ β
β β Sync received, save customer
β β Customer updated event fires
β Sync received β
β βββββββββββββββββββββββββ β
β Save customer β
β Customer updated event β
β βββββββββββββββββββββββββΆ β
β ...forever... β
The solution: source tracking. Every message carries an x-source header that identifies the originating system. When a system receives a message from itself (via the other system), it ignores it.
// When publishing a sync message
async function publishCustomerSync(customer: Customer, source: string) {
await messageQueue.publish('customer.updated', {
customerId: customer.id,
data: customer,
source: source, // "commerce" or "operations"
});
}
// When consuming a sync message
async function handleCustomerSync(message: CustomerSyncMessage) {
// Ignore messages that originated from this system
if (message.source === THIS_SYSTEM_ID) {
logger.debug('Ignoring self-originated sync', { customerId: message.customerId });
return; // ACK the message, don't process
}
const customer = await customerService.update(message.customerId, message.data);
// When saving, mark the source so downstream events carry it
await publishCustomerSync(customer, THIS_SYSTEM_ID);
}
An alternative approach: message deduplication by content hash. Hash the message payload and check if the same hash was processed recently. If the data hasn't changed, the sync is a no-op.
3. Duplicate Processing
Network failures, worker restarts, and at-least-once delivery guarantees mean the same message can be delivered more than once. Without idempotency, you get duplicate orders, duplicate emails, duplicate database records.
The solution: idempotency stores with business-key deduplication.
// Idempotency store with business key
class IdempotencyStore {
async checkAndAcquire(key: string, scope: string): Promise<boolean> {
try {
await this.db.insert('idempotency_keys', {
key,
scope,
status: 'PROCESSING',
created_at: new Date(),
expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24h TTL
});
return true; // Acquired, proceed with processing
} catch (error) {
if (isDuplicateKeyError(error)) {
return false; // Already processed, skip
}
throw error;
}
}
async complete(key: string, scope: string): Promise<void> {
await this.db.update('idempotency_keys',
{ status: 'COMPLETED', completed_at: new Date() },
{ key, scope }
);
}
}
// Usage in a worker
async function handleNotification(message: NotificationMessage) {
const dedupeKey = `${message.recipientEmail}:${message.category}:${message.entityRef}:${todayBucket()}`;
const acquired = await idempotencyStore.checkAndAcquire(dedupeKey, 'notification');
if (!acquired) {
logger.info('Duplicate notification skipped', { dedupeKey });
return; // ACK the message
}
try {
await emailService.send(message);
await idempotencyStore.complete(dedupeKey, 'notification');
} catch (error) {
await idempotencyStore.fail(dedupeKey, 'notification');
throw error; // NACK, let the queue retry
}
}
The dedupe key must be business-meaningful. For notifications: recipient + category + entity + day bucket (prevents sending the same notification twice in one day but allows it the next day). For orders: tenant + product + date. For imports: source record ID + import batch ID.
4. Dead Letter Handling
Messages that fail repeatedly end up in a dead letter queue. Most teams set up the DLQ and then ignore it. Dead letters accumulate. Months later, someone discovers thousands of unprocessed messages.
// Dead letter handler with classification
async function processDeadLetter(message: DeadLetterMessage) {
const failureType = classifyFailure(message.error);
switch (failureType) {
case 'TRANSIENT':
// Network timeout, service temporarily down
// Requeue with exponential backoff
await requeue(message, { delay: calculateBackoff(message.retryCount) });
break;
case 'PERMANENT':
// Invalid data, schema mismatch, business rule violation
// Log, alert, archive. Do not retry.
await archiveDeadLetter(message);
await alertOps('Permanent failure in dead letter', message);
break;
case 'POISON':
// Message itself causes crashes (malformed payload)
// Archive immediately, never retry
await archiveDeadLetter(message);
await alertOps('Poison message detected', message);
break;
}
}
function classifyFailure(error: string): 'TRANSIENT' | 'PERMANENT' | 'POISON' {
if (error.includes('ECONNREFUSED') || error.includes('timeout')) return 'TRANSIENT';
if (error.includes('SyntaxError') || error.includes('Cannot parse')) return 'POISON';
return 'PERMANENT';
}
The key insight: not all dead letters are the same. Transient failures should be retried. Permanent failures should be archived and investigated. Poison messages should be quarantined immediately.
For how we handle failure patterns in AI systems specifically, see our AI failure modes guide.
Choosing a Message Broker
We've used four different brokers in production. Each has a distinct sweet spot.
| Broker | Best For | Not Great For | Deployment Complexity |
|---|---|---|---|
| Kafka | High-throughput event streaming, event sourcing, replay | Simple job queues, low-volume systems | High (ZooKeeper/KRaft, partitions, consumer groups) |
| RabbitMQ | Routing, dead letters, priority queues, complex topologies | Very high throughput (millions/sec), event replay | Medium (clustering, HA policies) |
| BullMQ (Redis) | Job queues, delayed jobs, rate limiting, simple setups | Complex routing, message replay, multi-consumer patterns | Low (just Redis) |
| Symfony Messenger | PHP/Symfony apps, Pimcore, simple async dispatch | Non-PHP ecosystems, complex broker features | Low (uses Doctrine, AMQP, or Redis as transport) |
When to Use Kafka
- You need event replay (consumers can re-read from any offset)
- You have multiple consumer groups processing the same events differently
- Throughput exceeds 100K messages per second
- You're doing event sourcing or building a change data capture pipeline
- You need guaranteed ordering within a partition
When to Use RabbitMQ
- You need complex routing (topic exchanges, headers-based routing)
- Dead letter queues with automatic retry policies are important
- Message priority matters (urgent messages processed first)
- You need request-reply patterns (RPC over messages)
- Your system has multiple services in different languages
When to Use BullMQ
- You're in a Node.js/TypeScript ecosystem (Vendure, NestJS)
- Job scheduling with delays and cron patterns
- Rate limiting per queue or per job type
- You already have Redis for caching and sessions
- Your system is small to medium scale (under 10K messages/sec)
When to Use Symfony Messenger
- You're in a PHP/Symfony/Pimcore ecosystem
- You need simple async dispatch for background tasks
- Worker infrastructure follows Symfony conventions (supervisord)
- Transport flexibility (switch between Doctrine, AMQP, Redis without code changes)
For our Pimcore projects, we use Symfony Messenger with RabbitMQ transport. For Vendure projects, we use BullMQ. For high-throughput data ingestion platforms, we use Kafka. The broker choice follows the ecosystem, not the other way around.
Message Ordering
Most teams assume message ordering matters. Usually it doesn't.
| Scenario | Ordering Needed? | Why |
|---|---|---|
| Send notification email | No | Emails are inherently unordered |
| Generate thumbnail | No | Latest version wins |
| Update search index | No | Latest state wins (eventual consistency) |
| Process payment steps | Yes | Charge must happen before capture |
| Order state transitions | Yes | Can't ship before payment confirmed |
| Event sourcing replay | Yes | Events must replay in causal order |
When ordering matters, use FIFO queues (SQS FIFO, Kafka partitions keyed by entity ID, RabbitMQ with single consumer). When it doesn't, parallel processing is faster and simpler.
// Kafka: ordering within a partition (keyed by product ID)
await producer.send({
topic: 'product-updates',
messages: [{
key: productId, // All updates for same product go to same partition
value: JSON.stringify(event),
}],
});
// BullMQ: no ordering needed for thumbnails
await thumbnailQueue.add('generate', { productId }, {
removeOnComplete: true,
attempts: 3,
backoff: { type: 'exponential', delay: 5000 },
});
Retry Strategies
Not all retries are equal. The strategy depends on the failure type.
| Strategy | When to Use | Example |
|---|---|---|
| Immediate retry | Transient glitch (race condition) | Optimistic lock failure |
| Exponential backoff | Service temporarily down | External API timeout |
| Fixed delay | Rate limiting | API returns 429 |
| No retry | Permanent failure | Invalid payload, business rule violation |
// BullMQ retry configuration
const queue = new Queue('notifications', {
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'exponential',
delay: 60000, // 1min, 2min, 4min, 8min, 16min
},
removeOnComplete: { count: 1000 },
removeOnFail: false, // Keep for dead letter analysis
},
});
After all retries are exhausted, the message moves to a dead letter queue. Never retry infinitely. Set a maximum attempt count and handle the failure explicitly.
Monitoring Event-Driven Systems
The hardest part of event-driven systems is debugging. When something fails, the error might be 5 services and 12 messages away from the original cause.
Correlation IDs
Every message carries a correlation ID that links it to the original action:
// Original API request generates correlation ID
const correlationId = generateUUID();
// Every message in the chain carries it
await queue.add('process-order', {
orderId: order.id,
correlationId,
});
// Workers pass it to downstream messages
async function handleProcessOrder(job) {
const { orderId, correlationId } = job.data;
// ... process order ...
// Downstream messages carry the same correlation ID
await notificationQueue.add('send-confirmation', {
orderId,
correlationId, // Same ID, traceable back to the original request
});
}
When debugging, query all logs and events by correlation ID to see the full chain. For more on distributed tracing patterns, see our AI observability guide.
Queue Health Metrics
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Queue depth | < 100 | 100-1000 | > 1000 |
| Processing time (p95) | < 5s | 5-30s | > 30s |
| Failure rate | < 1% | 1-5% | > 5% |
| Dead letter count | 0 | 1-10 | > 10 |
| Consumer lag (Kafka) | < 1000 | 1K-10K | > 10K |
Alert on queue depth trends, not absolute values. A queue depth of 500 that's been stable for hours is fine. A queue depth of 50 that was 0 an hour ago and is growing is a problem.
Common Pitfalls
-
No subscriber control during worker saves. The single biggest source of event storms. Worker saves must not trigger the same event subscribers that dispatch work to workers.
-
No source tracking on bidirectional sync. Without
x-sourceheaders, bidirectional sync between two systems loops forever. -
Using message ID for idempotency. Message IDs change on redelivery in some brokers. Use business keys (entity ID + action + time bucket) instead.
-
Infinite retries. Set a maximum retry count. After exhaustion, move to dead letter queue. Never retry forever.
-
Ignoring dead letter queues. Dead letters are operational debt. Classify them (transient, permanent, poison) and handle each type differently.
-
Assuming message ordering. Most workloads don't need ordering. Parallel processing is faster. Only use FIFO when state transitions or causal ordering require it.
-
Same retry strategy for all failures. A timeout needs exponential backoff. An invalid payload needs zero retries. A rate limit needs fixed delay.
-
No correlation IDs. Without them, debugging a failure chain across 5 services is impossible.
Key Takeaways
-
Event storms are the most dangerous failure mode. One uncontrolled save cascade can saturate your entire infrastructure. The EventSubscriberSupervisor is not optional.
-
Bidirectional sync needs source tracking. Every message carries the originating system ID. When you receive a message from yourself (via the other system), ignore it.
-
Idempotency uses business keys, not message IDs. Recipient + category + entity + day bucket for notifications. Tenant + product + date for orders. The key must be business-meaningful.
-
Dead letters need classification. Transient failures get retried. Permanent failures get archived. Poison messages get quarantined. Never ignore the dead letter queue.
-
The broker follows the ecosystem. Kafka for high-throughput streaming, RabbitMQ for complex routing, BullMQ for Node.js job queues, Symfony Messenger for PHP. Don't pick a broker and then build around it.
-
Most workloads don't need ordering. Parallel processing is simpler and faster. Use FIFO only when state transitions demand it.
We apply these patterns across our custom software projects, data engineering pipelines, and cloud deployments. If you're building an event-driven system or debugging one that's already in trouble, talk to our team or request a quote.
Topics covered
Related Guides
Concurrency and Data Integrity: The Patterns That Saved Our Production
Production concurrency patterns for enterprise systems. Field ownership, optimistic locking, cooperative leases, idempotency stores, version management, and transaction governance layers.
Read guideDesigning Systems for Failure (Because They Will Fail)
Failure response patterns for production systems. Circuit breakers, retry strategies, graceful degradation, dead letter handling, timeout budgets, and chaos engineering for small teams.
Read guideMulti-Channel Commerce: The Architecture of Unified Checkout Across Five Suppliers
How to build unified checkout across multiple suppliers. Bidirectional sync, real-time availability proxy, channel scoping, order routing, and error handling when suppliers fail mid-checkout.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation