Technical Guide

Concurrency and Data Integrity: The Patterns That Saved Our Production

Production concurrency patterns for enterprise systems. Field ownership, optimistic locking, cooperative leases, idempotency stores, version management, and transaction governance layers.

January 24, 202616 min readOronts Engineering Team

The Race Condition You Don't See Until Production

Race conditions are invisible in development. Your local machine runs one process. Your test suite runs sequentially. Everything works. Then you deploy to production with 4 web pods and 3 worker pods, and two processes modify the same record at the same time. One overwrites the other's changes. Data is silently corrupted. Nobody notices until a customer complains.

We've fixed race conditions across multiple enterprise systems: CMS platforms with 20 editors and background workers, commerce platforms with concurrent order processing, and AI systems with parallel agent workflows. The patterns in this article are what survived.

For broader context, see our system architecture guide and event-driven architecture guide. For CMS-specific concurrency, our Pimcore workflow guide covers those patterns in depth.

Field Ownership: Who Can Write What

The root cause of most enterprise concurrency bugs: multiple writers modifying the same record through the same save path without coordination.

A CMS has editors writing product descriptions and workers generating thumbnails. Both call save(). Both persist the full object. If the worker saves after loading but before the editor saves, the editor's save overwrites the worker's thumbnail. If the editor saves first, the worker's save overwrites the editor's description.

The fix: assign every field to an owner.

field_ownership:
    Product:
        editor_owned:
            - name
            - description
            - images
        system_owned:
            - thumbnail
            - searchIndex
            - checksum
            - lastSyncTimestamp
        shared:
            - categories
            - price
            - availability
DomainOwnerMutation PathConflict Strategy
Editor-ownedAdmin usersStandard saveNo conflict (only editors write)
System-ownedWorkers/integrationsTransaction layer with locksRetry on conflict
SharedBothTransaction layer with conflict resolutionConfigurable: retry, skip, merge

Editor-owned fields go through the standard save path. System-owned fields go through a transaction layer with locks and version checks. Shared fields use explicit conflict resolution strategies.

Without field ownership, you're relying on luck. With it, the system enforces who can write what and resolves conflicts deterministically.

Optimistic Locking with Version Checks

Optimistic locking assumes conflicts are rare. Instead of locking before modification, it checks whether the record changed between load and save.

async function updateWithOptimisticLock(
    productId: string,
    updateFn: (product: Product) => void,
    maxRetries: number = 3,
): Promise<Product> {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        const product = await productRepo.findById(productId, { force: true });
        const versionBefore = product.versionCount;

        updateFn(product);

        // Check version hasn't changed before saving
        const currentVersion = await productRepo.getVersionCount(productId);
        if (currentVersion !== versionBefore) {
            if (attempt === maxRetries - 1) {
                throw new ConcurrencyError(
                    `Product ${productId} was modified concurrently (version ${versionBefore} -> ${currentVersion})`
                );
            }
            continue; // Retry with fresh data
        }

        await product.save();
        return product;
    }
}

The version check is not atomic in this example. For true atomicity, use database-level support:

-- PostgreSQL: atomic version check + update
UPDATE products
SET name = $1, version_count = version_count + 1
WHERE id = $2 AND version_count = $3;

-- If 0 rows affected: concurrent modification detected
// TypeORM: @VersionColumn for automatic optimistic locking
@Entity()
class Product {
    @VersionColumn()
    version!: number;
    // TypeORM automatically checks version on save
    // Throws OptimisticLockVersionMismatchError on conflict
}

Optimistic locking works well when conflicts are rare (< 5% of writes). For high-contention scenarios (multiple workers processing the same record), use cooperative locks instead.

Cooperative Lease-Based Locks

When multiple workers compete for the same resource, a cooperative lock serializes access. Unlike distributed mutexes, cooperative locks use lease-based semantics: the lock expires after a TTL, preventing deadlocks from crashed workers.

// Redis: atomic SET NX EX with token-based ownership
class RedisLockProvider {
    async acquire(key: string, ttlSeconds: number = 30): Promise<Lock | null> {
        const token = crypto.randomBytes(16).toString('hex');
        const acquired = await this.redis.set(key, token, 'NX', 'EX', ttlSeconds);
        return acquired ? new Lock(key, token, ttlSeconds) : null;
    }

    async release(lock: Lock): Promise<void> {
        // Atomic check-and-delete via Lua script
        const script = `
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('del', KEYS[1])
            else
                return 0
            end
        `;
        await this.redis.eval(script, 1, lock.key, lock.token);
    }

    async extend(lock: Lock, ttlSeconds: number): Promise<boolean> {
        const script = `
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('expire', KEYS[1], ARGV[2])
            else
                return 0
            end
        `;
        return !!(await this.redis.eval(script, 1, lock.key, lock.token, ttlSeconds));
    }
}

Lock Heartbeat

Lock TTL alone is not enough. If an operation runs longer than expected, the lock expires and another worker acquires it. Now two workers run concurrently.

async function executeWithLock(key: string, operation: () => Promise<void>) {
    const lock = await lockProvider.acquire(key, 30);
    if (!lock) throw new LockError(`Could not acquire lock: ${key}`);

    // Heartbeat: extend lock if operation takes too long
    const heartbeat = setInterval(async () => {
        const extended = await lockProvider.extend(lock, 30);
        if (!extended) {
            clearInterval(heartbeat);
            // Lock was stolen, abort operation
            throw new LockError(`Lock stolen during operation: ${key}`);
        }
    }, 15000); // Extend every 15s (50% of TTL)

    try {
        await operation();
    } finally {
        clearInterval(heartbeat);
        await lockProvider.release(lock);
    }
}

Lock Scope Hierarchy

Different operations need different lock granularity:

ScopeKey PatternUse Case
Elementlock:product:123Full object save
Field grouplock:product:123:generatedAssetsPartial update (thumbnails only)
Operationlock:product:123:thumbnail:enSingle specific operation
Same product + same field group  -> wait/retry (serial)
Same product + different groups  -> parallel (safe)
Different products               -> always parallel

Narrower scopes allow more parallelism. A thumbnail worker and a search indexer can process the same product simultaneously if they lock different field groups.

Idempotency Stores

Network retries, message redelivery, and workflow replays cause the same operation to execute multiple times. Without idempotency, you get duplicate records, double charges, or repeated emails.

interface IdempotencyEntry {
    key: string;           // Business-meaningful key
    scope: string;         // Operation category
    status: string;        // PENDING | COMPLETED | FAILED
    requestHash: string;   // SHA-256 of normalized input
    resultId?: string;     // ID of created resource
    expiresAt: Date;       // TTL for cleanup
    createdAt: Date;
}

class IdempotencyStore {
    async checkAndAcquire(key: string, scope: string, requestHash: string): Promise<IdempotencyResult> {
        try {
            await this.db.insert('idempotency_keys', {
                key, scope, requestHash,
                status: 'PENDING',
                expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000),
            });
            return { acquired: true };
        } catch (error) {
            if (isDuplicateKeyError(error)) {
                const existing = await this.db.findOne({ key, scope });
                if (existing.status === 'COMPLETED') {
                    return { acquired: false, cached: true, resultId: existing.resultId };
                }
                if (existing.status === 'PENDING' && isStale(existing)) {
                    // Stale PENDING: previous attempt crashed, allow retry
                    await this.db.update({ key, scope }, { status: 'PENDING', createdAt: new Date() });
                    return { acquired: true };
                }
                return { acquired: false, inProgress: true };
            }
            throw error;
        }
    }

    async complete(key: string, scope: string, resultId: string): Promise<void> {
        await this.db.update({ key, scope }, { status: 'COMPLETED', resultId });
    }
}

Key Design

The idempotency key must be business-meaningful:

OperationKey ComponentsExample
Add to wishlistcustomerId + productVariantId + wishlistIdwish:cust_123:var_456:wl_789
Submit reviewcustomerId + productIdreview:cust_123:prod_456
Send notificationrecipient + category + entityRef + dayBucketnotify:sara@beispiel.de:stock:var_456:2026-04-20
Redeem loyalty pointscustomerId + orderId + pointsredeem:cust_123:ord_789:500
ERP import recordsourceRecordId + importBatchIdimport:erp_456:batch_20260420

Two distinct idempotency models:

  • API idempotency: For user-initiated mutations. Client provides key or it's generated from input hash. Cached response replayed on duplicate.
  • Job idempotency: For background processing. Dedupe key in job payload. Uses DB constraints, completion markers, or business-key checks.

Never mix them. They have different lifecycles and cleanup strategies.

The Version Explosion Problem

In systems where background workers call save(), every save creates a version. With 6 workers processing every product change, a single editor save generates 6+ unnecessary versions. Over months, products accumulate thousands of versions that consume storage, slow down the version history UI, and make actual editorial changes impossible to find.

The solution: scoped version guards that suppress version creation during system operations while preserving versions for editor saves.

// Reference-counted version guard
class ScopedVersionGuard {
    private static refCount = 0;

    suppress(): void {
        ScopedVersionGuard.refCount++;
        if (ScopedVersionGuard.refCount === 1) {
            VersionManager.disable();
        }
    }

    restore(): void {
        ScopedVersionGuard.refCount--;
        if (ScopedVersionGuard.refCount === 0) {
            VersionManager.enable();
        }
    }
}

// Usage: nested operations work correctly
const outerGuard = new ScopedVersionGuard();
outerGuard.suppress();
try {
    product.setThumbnail(asset);
    product.save(); // No version created

    const innerGuard = new ScopedVersionGuard();
    innerGuard.suppress();
    try {
        product.setChecksum(hash);
        product.save(); // Still no version
    } finally {
        innerGuard.restore(); // refCount goes from 2 to 1, still suppressed
    }
} finally {
    outerGuard.restore(); // refCount goes from 1 to 0, versions re-enabled
}

The result: editor saves create versions (audit trail preserved). Worker saves create operation log entries (observability without version bloat).

For how we implement this in Pimcore specifically, see our Pimcore workflow guide which covers PimTx's version guard in detail.

Common Pitfalls

  1. No field ownership. Without it, every writer competes for every field through the same save path. Define who owns what before writing the first line of concurrent code.

  2. Optimistic locking without retry. Detecting the conflict is not enough. The operation must retry with fresh data. Set a max retry count and handle exhaustion gracefully.

  3. Global lock disable. A global disable() flag breaks when multiple operations run concurrently. Use reference-counted, scoped guards.

  4. Idempotency keys without business meaning. A random UUID as idempotency key prevents nothing. The key must encode the business operation: who, what, when.

  5. No heartbeat on long-running locks. If the operation outlasts the TTL, the lock expires and another worker enters. Extend the lock at 50% of TTL.

  6. Ignoring stale PENDING entries. If a worker crashes while holding a PENDING idempotency key, the operation is permanently blocked. Detect and recover stale entries.

  7. Locking at the wrong granularity. Element-level locks serialize everything. Field-group locks allow parallelism where it's safe. Choose the narrowest safe scope.

  8. No version management strategy. Every save() creating a version is the default. In systems with workers, this creates thousands of useless versions. Suppress versions for system operations.

Key Takeaways

  • Field ownership prevents the most common race condition. Define which fields belong to editors, which to workers, and which are shared. The ownership registry determines locking strategy and conflict resolution.

  • Optimistic locking for low-contention writes. Check the version count before saving. Retry on conflict. Use database-level support (TypeORM @VersionColumn, PostgreSQL conditional update) for atomicity.

  • Cooperative leases for high-contention resources. Redis SET NX EX with token-based ownership. Heartbeat at 50% TTL. Lua scripts for atomic operations. Never use distributed mutexes without TTL.

  • Idempotency keys must be business-meaningful. Encode the operation semantics (who + what + when) into the key. Separate API idempotency from job idempotency.

  • Version guards preserve audit trails without explosion. Reference-counted suppression for system operations. Nesting works correctly. Editor saves still create versions.

We apply these patterns across our custom software projects and data engineering pipelines. If you're dealing with concurrency issues in production, talk to our team or request a quote.

Topics covered

concurrency patternsrace conditionsoptimistic lockingdata integritydistributed locksfield ownershipidempotencyversion management

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation