Technical Guide

Product Data Systems That Actually Work: From ERP to Channel

How to design product data pipelines. ERP to PIM to search to commerce to export. Classification systems, variant management, asset pipelines, and multi-channel distribution.

March 21, 202614 min readOronts Engineering Team

The Product Data Problem Nobody Talks About

Product data looks simple in a spreadsheet. Name, description, price, image. Then reality hits. You have 50,000 products with 200 attributes each, in 12 languages, from 3 source systems, distributed to 5 output channels, and the data quality is inconsistent across all of them.

The real problem is not storing product data. Any database does that. The real problem is the pipeline: how data flows from source systems through enrichment to output channels, with validation at every stage and different formats for every destination.

We've designed product data systems for B2B manufacturers with complex classification hierarchies, multi-locale content, ERP integration, and multi-channel output (web, search, marketplace, export). This article covers the architecture patterns. For PIM-specific implementation, see our PIM implementation guide. For Pimcore workflow patterns, see our Pimcore workflow guide.

The Pipeline: Source to Channel

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SOURCES     β”‚     β”‚   MASTER      β”‚     β”‚   CHANNELS    β”‚
β”‚               β”‚     β”‚               β”‚     β”‚               β”‚
β”‚  ERP/SAP     │────▢│  PIM System   │────▢│  Website      β”‚
β”‚  Suppliers   β”‚     β”‚  (Pimcore,    β”‚     β”‚  Search Index β”‚
β”‚  Spreadsheetsβ”‚     β”‚   Akeneo)     β”‚     β”‚  Marketplace  β”‚
β”‚  Manual entryβ”‚     β”‚               β”‚     β”‚  Print/PDF    β”‚
β”‚              β”‚     β”‚  Enrich       β”‚     β”‚  Partner API  β”‚
β”‚              β”‚     β”‚  Validate     β”‚     β”‚  Data Feed    β”‚
β”‚              β”‚     β”‚  Approve      β”‚     β”‚               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Source Layer

Products enter the system from multiple sources. Each source has different data quality, different formats, and different update frequencies.

SourceData QualityFormatFrequency
ERP (SAP, Oracle)Structured, reliableAPI / flat fileDaily batch or real-time
Supplier feedsVariable, often messyCSV, XML, JSONWeekly or on-demand
Manual entryHigh quality, low volumePIM admin UIContinuous
Spreadsheet importsError-proneXLSX, CSVAd-hoc

The import layer must handle: field mapping (supplier calls it "item_name", ERP calls it "MATNR"), data transformation (price in EUR to cents), deduplication (same product from two sources), and conflict resolution (which source wins for which field).

This is exactly the problem our Vendure Data Hub Plugin solves with 9 extractors, 61 transform operators, and configurable field mapping.

Master Layer (PIM)

The PIM is the single source of truth for product data. Every field has one authoritative value. Every change is versioned. Every product goes through an editorial workflow before publication.

Key PIM responsibilities:

  • Data enrichment: Add descriptions, images, translations that don't come from the ERP
  • Classification: Assign products to hierarchical categories with typed attributes
  • Validation: Ensure required fields are filled before publication
  • Workflow: Editorial review and approval before data reaches output channels
  • Versioning: Track every change, support draft editing without affecting live data

Channel Layer

Each output channel needs product data in a different format:

ChannelFormatContentUpdate Frequency
WebsiteJSON APIFull product with images, descriptions, variantsReal-time (event-driven)
Search indexDenormalized documentSearchable fields, facets, pricesNear real-time
MarketplaceFeed XML/CSVPlatform-specific fields, categoriesScheduled (hourly/daily)
Print/PDFStructured dataSelected fields, high-res imagesOn-demand
Partner APIREST/GraphQLContracted fields onlyReal-time
Data feedCSV/XMLGoogle Merchant, Meta CatalogScheduled

The same product data, transformed differently for each channel. The PIM stores the master data. The distribution layer transforms and delivers.

Classification Systems

Products have attributes. A shoe has size, color, material, and sole type. A faucet has flow rate, connection type, finish, and certification. A server has CPU, RAM, storage, and rack units.

Flat Attributes vs Classification Store

Flat attributes (columns on the product table) work for simple catalogs with uniform products. Every product has the same fields.

Classification stores (dynamic key-value with groups) work for diverse catalogs where different product types have different attributes.

Classification Store:
β”œβ”€β”€ Group: Dimensions
β”‚   β”œβ”€β”€ Key: width (float, unit: mm)
β”‚   β”œβ”€β”€ Key: height (float, unit: mm)
β”‚   └── Key: depth (float, unit: mm)
β”œβ”€β”€ Group: Technical
β”‚   β”œβ”€β”€ Key: flow_rate (float, unit: l/min)
β”‚   β”œβ”€β”€ Key: pressure (float, unit: bar)
β”‚   └── Key: connection_type (select: 3/8", 1/2", 3/4")
β”œβ”€β”€ Group: Certifications
β”‚   β”œβ”€β”€ Key: ce_mark (boolean)
β”‚   β”œβ”€β”€ Key: tuv (boolean)
β”‚   └── Key: energy_label (select: A-G)

Classification stores scale to thousands of attributes without schema changes. New attributes are configuration, not migration. But they're harder to query (key-value lookups instead of column access) and harder to validate (schema is dynamic).

In Pimcore, the Classification Store provides this capability out of the box with localized values, group-based organization, and admin UI integration.

Variant Management

Products with variants (size, color, configuration) are the source of most data complexity.

Variant Architecture

Product (parent)
β”œβ”€β”€ name: "Running Shoe Pro"
β”œβ”€β”€ description: "..."
β”œβ”€β”€ brand: "..."
β”œβ”€β”€ images: [hero.jpg, detail.jpg]
β”‚
β”œβ”€β”€ Variant: Size 40, Black
β”‚   β”œβ”€β”€ sku: "RSP-40-BLK"
β”‚   β”œβ”€β”€ price: 12900  (cents)
β”‚   β”œβ”€β”€ stock: 15
β”‚   └── ean: "4012345678901"
β”‚
β”œβ”€β”€ Variant: Size 40, White
β”‚   β”œβ”€β”€ sku: "RSP-40-WHT"
β”‚   β”œβ”€β”€ price: 12900
β”‚   β”œβ”€β”€ stock: 8
β”‚   └── ean: "4012345678902"
β”‚
└── Variant: Size 42, Black
    β”œβ”€β”€ sku: "RSP-42-BLK"
    β”œβ”€β”€ price: 12900
    β”œβ”€β”€ stock: 0  (out of stock)
    └── ean: "4012345678903"

Inherited vs variant-specific data:

  • Inherited (from parent): name, description, brand, category, shared images
  • Variant-specific: SKU, price, stock, EAN, size, color, variant-specific images

The inheritance model reduces duplication. Change the product description once, it updates for all variants. But variant-specific overrides must be possible (different price per size, different image per color).

The Combinatorial Explosion

A product with 5 sizes and 8 colors has 40 variants. Add 3 materials and you have 120. Add 2 widths and you have 240. Most of these combinations don't actually exist as real products.

Solutions:

  • Explicit variants only: Create only the combinations that exist. No auto-generation.
  • Availability matrix: Define which combinations are valid. Auto-generate only valid ones.
  • Virtual variants: Calculate at query time from attribute sets. Don't store individual records.

Asset Pipeline

Product images, technical drawings, PDFs, and 3D models need their own pipeline.

Upload/Import
  β”‚
  β”œβ”€β”€ Format validation (type, size, resolution)
  β”œβ”€β”€ Metadata extraction (EXIF, dimensions)
  β”œβ”€β”€ Thumbnail generation (multiple sizes)
  β”œβ”€β”€ CDN distribution
  └── Association to product/variant

Asset Challenges

ChallengeSolution
Multiple image sizes neededGenerate thumbnails on upload or on-demand
Images from suppliers are low qualityMinimum resolution requirements, rejection workflow
Asset-product associationNaming convention (SKU-based) or manual assignment
Storage costs at scaleCloud storage (S3, Azure Blob) with CDN
Localized assetsDifferent images per locale (lifestyle vs technical)

For Pimcore specifically, we documented a performance bug where asset dimension lookups trigger remote storage I/O on every page render. The fix is described in our Pimcore upgrade guide.

Multi-Channel Distribution

Event-Driven Distribution

When a product is published in the PIM, events trigger distribution to each channel:

// PIM publishes product -> events trigger channel updates
eventBus.on('product.published', async (product) => {
    await Promise.allSettled([
        searchIndexer.index(product),           // Update search
        feedGenerator.queue(product),            // Queue for marketplace feeds
        cacheInvalidator.invalidate(product.id), // Invalidate website cache
        partnerApi.notify(product.id),           // Notify partner systems
    ]);
});

Each channel transformer converts the master data to channel-specific format:

// Website: full data with SEO fields
function toWebProduct(product: PimProduct): WebProduct {
    return {
        slug: product.slug,
        name: product.name,
        description: product.description,
        seoTitle: product.seoTitle || product.name,
        seoDescription: product.seoDescription || truncate(product.description, 160),
        images: product.images.map(img => ({
            url: cdn.getUrl(img, 'large'),
            alt: img.alt || product.name,
        })),
        variants: product.variants.filter(v => v.active),
        // ... full product data
    };
}

// Marketplace feed: platform-specific format
function toGoogleMerchantItem(product: PimProduct, variant: PimVariant): MerchantItem {
    return {
        id: variant.sku,
        title: `${product.name} - ${variant.size} ${variant.color}`,
        description: stripHtml(product.description),
        link: `https://shop.example.com/p/${product.slug}`,
        image_link: cdn.getUrl(product.images[0], 'large'),
        price: `${(variant.price / 100).toFixed(2)} EUR`,
        availability: variant.stock > 0 ? 'in_stock' : 'out_of_stock',
        brand: product.brand,
        gtin: variant.ean,
        condition: 'new',
    };
}

Data Quality

Automated Validation

interface ValidationRule {
    field: string;
    check: (value: any, product: Product) => boolean;
    message: string;
    severity: 'error' | 'warning';
}

const VALIDATION_RULES: ValidationRule[] = [
    { field: 'name', check: (v) => v && v.length > 3, message: 'Name must be at least 3 characters', severity: 'error' },
    { field: 'description', check: (v) => v && v.length > 50, message: 'Description should be at least 50 characters', severity: 'warning' },
    { field: 'images', check: (v) => v && v.length > 0, message: 'At least one image required', severity: 'error' },
    { field: 'price', check: (v) => v && v > 0, message: 'Price must be positive', severity: 'error' },
    { field: 'ean', check: (v) => !v || isValidEan(v), message: 'Invalid EAN format', severity: 'error' },
    { field: 'categories', check: (v) => v && v.length > 0, message: 'At least one category required', severity: 'error' },
];

Run validation before publication. Block publication on errors. Show warnings but allow publication. Track data quality scores per product, per category, per supplier.

Common Pitfalls

  1. No single source of truth. If the same product data lives in the ERP, the PIM, and the commerce system with no clear master, conflicts are inevitable.

  2. ERP as the PIM. ERPs store operational data (SKU, price, stock). They don't handle rich content (descriptions, images, translations). Don't try to make them.

  3. No field ownership. Without clear rules about which source owns which field, imports overwrite manual enrichment. See our Pimcore workflow guide for the field ownership pattern.

  4. Same format for all channels. Each channel needs different data. A Google Merchant feed and a website API have different fields, formats, and update frequencies.

  5. No validation before publication. Products go live with missing images, empty descriptions, or invalid EANs. Automated validation prevents this.

  6. Ignoring variant complexity. Auto-generating all possible combinations creates thousands of phantom variants. Only create real, available combinations.

Key Takeaways

  • The pipeline is the architecture. Source to master to channel. Each stage has different responsibilities, different data formats, and different quality requirements.

  • The PIM is the single source of truth. Not the ERP, not the commerce system, not the spreadsheet. One authoritative master with versioning and workflow.

  • Classification stores handle attribute diversity. When different product types have different attributes, dynamic key-value with groups scales better than fixed columns.

  • Each channel gets its own transformation. Website, search, marketplace, print, and partner API all need different formats from the same master data.

  • Data quality is a system, not a step. Automated validation, quality scores, blocking rules on publication. Continuous, not one-time.

We design product data systems as part of our ecommerce and data engineering practice. If you need help with PIM architecture or product data pipelines, talk to our team or request a quote.

Topics covered

PIM architectureproduct data managementPimcore vs Akeneoproduct data pipelineMDMproduct information managementproduct data quality

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation