Technical Guide

Event-Driven Architecture in Practice: What Actually Goes Wrong

Real event-driven architecture patterns from production. Event storms, bidirectional sync loops, dead letters, idempotency stores, and choosing between Kafka, RabbitMQ, BullMQ, and Symfony Messenger.

February 23, 202618 min readOronts Engineering Team

The Event Storm Nobody Expects

Event-driven architecture looks clean on a whiteboard. Service A emits an event. Service B consumes it. Loose coupling. Independent scaling. Beautiful diagrams.

Then you deploy to production. One product save triggers 8 event subscribers. Each subscriber dispatches 3 async messages. Each message handler loads the product, modifies a field, and saves. Each save triggers 8 subscribers again. Within seconds, your message queue has thousands of messages for a single product update, your workers are saturated, and your database is under lock contention.

We've operated four different event-driven systems in production. Each used a different message broker. Each had different failure patterns. This article covers what actually goes wrong and how to fix it. For broader context on how we approach system architecture, that guide covers our methodology.

The Four Failure Patterns

Every event-driven system eventually encounters these:

PatternWhat HappensSeverity
Event stormOne action cascades into thousands of messagesCritical (can saturate infrastructure)
Bidirectional sync loopSystem A syncs to B, B syncs back to A, infinite loopCritical (infinite message growth)
Duplicate processingSame message processed twice, creates duplicate dataHigh (data integrity)
Dead letter accumulationFailed messages pile up with no resolution strategyMedium (operational debt)

1. Event Storms

The most common and most dangerous pattern. It happens when event handlers trigger new events, and those events trigger more handlers.

Product save
  -> EventSubscriber A dispatches MessageA
  -> EventSubscriber B dispatches MessageB
  -> EventSubscriber C dispatches MessageC
  -> ...8 subscribers total

WorkerA processes MessageA
  -> Modifies product, calls save()
  -> Product save triggers 8 subscribers again
  -> Each dispatches more messages

WorkerB processes MessageB (same pattern)
  -> 8 more messages

Result: 1 save -> 8 messages -> 8 saves -> 64 messages -> ...

The solution: EventSubscriberSupervisor. A process-scoped control that disables event subscribers during worker saves.

// Worker handler: disable subscribers before saving
async function handleGenerateThumbnail(message: ThumbnailMessage) {
    const product = await productService.findById(message.productId);

    subscriberSupervisor.disableAll();
    try {
        product.thumbnail = await generateThumbnail(product);
        await product.save(); // No subscribers fire, no new messages dispatched
    } finally {
        subscriberSupervisor.enableAll();
    }
}

The order matters. The save must happen inside the disabled scope. If you re-enable subscribers before saving, the save triggers the cascade.

For a deeper look at how we apply this pattern specifically in Pimcore, see our Pimcore workflow design guide.

2. Bidirectional Sync Loops

When two systems sync data bidirectionally, each update from A triggers a sync to B, which triggers a sync back to A. Without loop prevention, this runs forever.

System A (Commerce)         System B (Operations)
     β”‚                           β”‚
     β”‚  Customer updated         β”‚
     β”‚ ────────────────────────▢ β”‚
     β”‚                           β”‚  Sync received, save customer
     β”‚                           β”‚  Customer updated event fires
     β”‚  Sync received            β”‚
     β”‚ ◀──────────────────────── β”‚
     β”‚  Save customer            β”‚
     β”‚  Customer updated event   β”‚
     β”‚ ────────────────────────▢ β”‚
     β”‚           ...forever...   β”‚

The solution: source tracking. Every message carries an x-source header that identifies the originating system. When a system receives a message from itself (via the other system), it ignores it.

// When publishing a sync message
async function publishCustomerSync(customer: Customer, source: string) {
    await messageQueue.publish('customer.updated', {
        customerId: customer.id,
        data: customer,
        source: source,  // "commerce" or "operations"
    });
}

// When consuming a sync message
async function handleCustomerSync(message: CustomerSyncMessage) {
    // Ignore messages that originated from this system
    if (message.source === THIS_SYSTEM_ID) {
        logger.debug('Ignoring self-originated sync', { customerId: message.customerId });
        return; // ACK the message, don't process
    }

    const customer = await customerService.update(message.customerId, message.data);

    // When saving, mark the source so downstream events carry it
    await publishCustomerSync(customer, THIS_SYSTEM_ID);
}

An alternative approach: message deduplication by content hash. Hash the message payload and check if the same hash was processed recently. If the data hasn't changed, the sync is a no-op.

3. Duplicate Processing

Network failures, worker restarts, and at-least-once delivery guarantees mean the same message can be delivered more than once. Without idempotency, you get duplicate orders, duplicate emails, duplicate database records.

The solution: idempotency stores with business-key deduplication.

// Idempotency store with business key
class IdempotencyStore {
    async checkAndAcquire(key: string, scope: string): Promise<boolean> {
        try {
            await this.db.insert('idempotency_keys', {
                key,
                scope,
                status: 'PROCESSING',
                created_at: new Date(),
                expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24h TTL
            });
            return true; // Acquired, proceed with processing
        } catch (error) {
            if (isDuplicateKeyError(error)) {
                return false; // Already processed, skip
            }
            throw error;
        }
    }

    async complete(key: string, scope: string): Promise<void> {
        await this.db.update('idempotency_keys',
            { status: 'COMPLETED', completed_at: new Date() },
            { key, scope }
        );
    }
}

// Usage in a worker
async function handleNotification(message: NotificationMessage) {
    const dedupeKey = `${message.recipientEmail}:${message.category}:${message.entityRef}:${todayBucket()}`;

    const acquired = await idempotencyStore.checkAndAcquire(dedupeKey, 'notification');
    if (!acquired) {
        logger.info('Duplicate notification skipped', { dedupeKey });
        return; // ACK the message
    }

    try {
        await emailService.send(message);
        await idempotencyStore.complete(dedupeKey, 'notification');
    } catch (error) {
        await idempotencyStore.fail(dedupeKey, 'notification');
        throw error; // NACK, let the queue retry
    }
}

The dedupe key must be business-meaningful. For notifications: recipient + category + entity + day bucket (prevents sending the same notification twice in one day but allows it the next day). For orders: tenant + product + date. For imports: source record ID + import batch ID.

4. Dead Letter Handling

Messages that fail repeatedly end up in a dead letter queue. Most teams set up the DLQ and then ignore it. Dead letters accumulate. Months later, someone discovers thousands of unprocessed messages.

// Dead letter handler with classification
async function processDeadLetter(message: DeadLetterMessage) {
    const failureType = classifyFailure(message.error);

    switch (failureType) {
        case 'TRANSIENT':
            // Network timeout, service temporarily down
            // Requeue with exponential backoff
            await requeue(message, { delay: calculateBackoff(message.retryCount) });
            break;

        case 'PERMANENT':
            // Invalid data, schema mismatch, business rule violation
            // Log, alert, archive. Do not retry.
            await archiveDeadLetter(message);
            await alertOps('Permanent failure in dead letter', message);
            break;

        case 'POISON':
            // Message itself causes crashes (malformed payload)
            // Archive immediately, never retry
            await archiveDeadLetter(message);
            await alertOps('Poison message detected', message);
            break;
    }
}

function classifyFailure(error: string): 'TRANSIENT' | 'PERMANENT' | 'POISON' {
    if (error.includes('ECONNREFUSED') || error.includes('timeout')) return 'TRANSIENT';
    if (error.includes('SyntaxError') || error.includes('Cannot parse')) return 'POISON';
    return 'PERMANENT';
}

The key insight: not all dead letters are the same. Transient failures should be retried. Permanent failures should be archived and investigated. Poison messages should be quarantined immediately.

For how we handle failure patterns in AI systems specifically, see our AI failure modes guide.

Choosing a Message Broker

We've used four different brokers in production. Each has a distinct sweet spot.

BrokerBest ForNot Great ForDeployment Complexity
KafkaHigh-throughput event streaming, event sourcing, replaySimple job queues, low-volume systemsHigh (ZooKeeper/KRaft, partitions, consumer groups)
RabbitMQRouting, dead letters, priority queues, complex topologiesVery high throughput (millions/sec), event replayMedium (clustering, HA policies)
BullMQ (Redis)Job queues, delayed jobs, rate limiting, simple setupsComplex routing, message replay, multi-consumer patternsLow (just Redis)
Symfony MessengerPHP/Symfony apps, Pimcore, simple async dispatchNon-PHP ecosystems, complex broker featuresLow (uses Doctrine, AMQP, or Redis as transport)

When to Use Kafka

  • You need event replay (consumers can re-read from any offset)
  • You have multiple consumer groups processing the same events differently
  • Throughput exceeds 100K messages per second
  • You're doing event sourcing or building a change data capture pipeline
  • You need guaranteed ordering within a partition

When to Use RabbitMQ

  • You need complex routing (topic exchanges, headers-based routing)
  • Dead letter queues with automatic retry policies are important
  • Message priority matters (urgent messages processed first)
  • You need request-reply patterns (RPC over messages)
  • Your system has multiple services in different languages

When to Use BullMQ

  • You're in a Node.js/TypeScript ecosystem (Vendure, NestJS)
  • Job scheduling with delays and cron patterns
  • Rate limiting per queue or per job type
  • You already have Redis for caching and sessions
  • Your system is small to medium scale (under 10K messages/sec)

When to Use Symfony Messenger

  • You're in a PHP/Symfony/Pimcore ecosystem
  • You need simple async dispatch for background tasks
  • Worker infrastructure follows Symfony conventions (supervisord)
  • Transport flexibility (switch between Doctrine, AMQP, Redis without code changes)

For our Pimcore projects, we use Symfony Messenger with RabbitMQ transport. For Vendure projects, we use BullMQ. For high-throughput data ingestion platforms, we use Kafka. The broker choice follows the ecosystem, not the other way around.

Message Ordering

Most teams assume message ordering matters. Usually it doesn't.

ScenarioOrdering Needed?Why
Send notification emailNoEmails are inherently unordered
Generate thumbnailNoLatest version wins
Update search indexNoLatest state wins (eventual consistency)
Process payment stepsYesCharge must happen before capture
Order state transitionsYesCan't ship before payment confirmed
Event sourcing replayYesEvents must replay in causal order

When ordering matters, use FIFO queues (SQS FIFO, Kafka partitions keyed by entity ID, RabbitMQ with single consumer). When it doesn't, parallel processing is faster and simpler.

// Kafka: ordering within a partition (keyed by product ID)
await producer.send({
    topic: 'product-updates',
    messages: [{
        key: productId,  // All updates for same product go to same partition
        value: JSON.stringify(event),
    }],
});

// BullMQ: no ordering needed for thumbnails
await thumbnailQueue.add('generate', { productId }, {
    removeOnComplete: true,
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
});

Retry Strategies

Not all retries are equal. The strategy depends on the failure type.

StrategyWhen to UseExample
Immediate retryTransient glitch (race condition)Optimistic lock failure
Exponential backoffService temporarily downExternal API timeout
Fixed delayRate limitingAPI returns 429
No retryPermanent failureInvalid payload, business rule violation
// BullMQ retry configuration
const queue = new Queue('notifications', {
    defaultJobOptions: {
        attempts: 5,
        backoff: {
            type: 'exponential',
            delay: 60000, // 1min, 2min, 4min, 8min, 16min
        },
        removeOnComplete: { count: 1000 },
        removeOnFail: false, // Keep for dead letter analysis
    },
});

After all retries are exhausted, the message moves to a dead letter queue. Never retry infinitely. Set a maximum attempt count and handle the failure explicitly.

Monitoring Event-Driven Systems

The hardest part of event-driven systems is debugging. When something fails, the error might be 5 services and 12 messages away from the original cause.

Correlation IDs

Every message carries a correlation ID that links it to the original action:

// Original API request generates correlation ID
const correlationId = generateUUID();

// Every message in the chain carries it
await queue.add('process-order', {
    orderId: order.id,
    correlationId,
});

// Workers pass it to downstream messages
async function handleProcessOrder(job) {
    const { orderId, correlationId } = job.data;

    // ... process order ...

    // Downstream messages carry the same correlation ID
    await notificationQueue.add('send-confirmation', {
        orderId,
        correlationId, // Same ID, traceable back to the original request
    });
}

When debugging, query all logs and events by correlation ID to see the full chain. For more on distributed tracing patterns, see our AI observability guide.

Queue Health Metrics

MetricHealthyWarningCritical
Queue depth< 100100-1000> 1000
Processing time (p95)< 5s5-30s> 30s
Failure rate< 1%1-5%> 5%
Dead letter count01-10> 10
Consumer lag (Kafka)< 10001K-10K> 10K

Alert on queue depth trends, not absolute values. A queue depth of 500 that's been stable for hours is fine. A queue depth of 50 that was 0 an hour ago and is growing is a problem.

Common Pitfalls

  1. No subscriber control during worker saves. The single biggest source of event storms. Worker saves must not trigger the same event subscribers that dispatch work to workers.

  2. No source tracking on bidirectional sync. Without x-source headers, bidirectional sync between two systems loops forever.

  3. Using message ID for idempotency. Message IDs change on redelivery in some brokers. Use business keys (entity ID + action + time bucket) instead.

  4. Infinite retries. Set a maximum retry count. After exhaustion, move to dead letter queue. Never retry forever.

  5. Ignoring dead letter queues. Dead letters are operational debt. Classify them (transient, permanent, poison) and handle each type differently.

  6. Assuming message ordering. Most workloads don't need ordering. Parallel processing is faster. Only use FIFO when state transitions or causal ordering require it.

  7. Same retry strategy for all failures. A timeout needs exponential backoff. An invalid payload needs zero retries. A rate limit needs fixed delay.

  8. No correlation IDs. Without them, debugging a failure chain across 5 services is impossible.

Key Takeaways

  • Event storms are the most dangerous failure mode. One uncontrolled save cascade can saturate your entire infrastructure. The EventSubscriberSupervisor is not optional.

  • Bidirectional sync needs source tracking. Every message carries the originating system ID. When you receive a message from yourself (via the other system), ignore it.

  • Idempotency uses business keys, not message IDs. Recipient + category + entity + day bucket for notifications. Tenant + product + date for orders. The key must be business-meaningful.

  • Dead letters need classification. Transient failures get retried. Permanent failures get archived. Poison messages get quarantined. Never ignore the dead letter queue.

  • The broker follows the ecosystem. Kafka for high-throughput streaming, RabbitMQ for complex routing, BullMQ for Node.js job queues, Symfony Messenger for PHP. Don't pick a broker and then build around it.

  • Most workloads don't need ordering. Parallel processing is simpler and faster. Use FIFO only when state transitions demand it.

We apply these patterns across our custom software projects, data engineering pipelines, and cloud deployments. If you're building an event-driven system or debugging one that's already in trouble, talk to our team or request a quote.

Topics covered

event-driven architecturemessage queues productionevent sourcingCQRS productionRabbitMQ vs KafkaBullMQevent stormsidempotencydead letter queuebidirectional sync

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation