Designing AI Workflows That Actually Work in Production
A hands-on guide to building robust AI pipelines. Learn pipeline architecture, step sequencing, branching logic, error handling, and prompt chaining from engineers who've shipped these systems.
Why Most AI Workflows Fail in Production
Here's something we've learned the hard way: getting an AI to do something impressive in a demo is easy. Getting it to do that same thing reliably, thousands of times a day, with real data and real edge cases? That's where most teams hit a wall.
We've built AI workflows for document processing, customer support automation, content generation, and data analysis. Along the way, we've made every mistake you can imagine. This guide is what we wish someone had told us before we started.
The difference between a prototype and a production system isn't the model you use. It's everything around the model.
Let me walk you through how we actually design AI workflows now, after learning these lessons.
The Anatomy of a Production AI Pipeline
Think of an AI workflow as a series of processing stages, each with a specific job. Here's the basic structure we use:
Input → Validation → Preprocessing → AI Processing → Post-processing → Validation → Output
↓ ↓ ↓ ↓ ↓
[Error] [Transform] [LLM Call] [Parse] [Quality]
Handler + Enrich + Tools + Format Check
Each stage has clear inputs, outputs, and failure modes. Let's break them down.
Stage 1: Input Validation and Normalization
Never trust incoming data. Ever. We learned this when a customer sent us a "text document" that was actually a 50MB binary file. The pipeline choked, the queue backed up, and we spent a weekend fixing it.
const validateInput = async (input) => {
const checks = {
exists: input !== null && input !== undefined,
sizeOk: Buffer.byteLength(input, 'utf8') < MAX_INPUT_SIZE,
formatOk: isValidFormat(input),
contentOk: !containsMaliciousPatterns(input)
};
const failures = Object.entries(checks)
.filter(([_, passed]) => !passed)
.map(([check]) => check);
if (failures.length > 0) {
throw new ValidationError(`Input failed: ${failures.join(', ')}`);
}
return normalizeInput(input);
};
What to check:
- File size limits (set them lower than you think you need)
- Format validation (is it actually JSON, not just named
.json?) - Character encoding (UTF-8 issues will haunt you)
- Content safety (don't send malicious content to your LLM)
Stage 2: Preprocessing and Enrichment
Raw input rarely goes straight to an LLM. You usually need to transform it, add context, or break it into chunks.
| Preprocessing Task | When to Use | Example |
|---|---|---|
| Chunking | Long documents exceeding context limits | Split a 100-page PDF into 2000-token chunks |
| Enrichment | Need additional context | Add customer history before processing support ticket |
| Extraction | Only parts of input are relevant | Pull just the "description" field from a JSON payload |
| Transformation | Format conversion needed | Convert HTML to markdown for cleaner processing |
| Deduplication | Repeated content wastes tokens | Remove duplicate paragraphs from scraped content |
Here's a chunking strategy we actually use:
const chunkDocument = (text, options = {}) => {
const {
maxTokens = 2000,
overlap = 200,
preserveParagraphs = true
} = options;
const chunks = [];
let currentChunk = '';
const paragraphs = text.split(/\n\n+/);
for (const para of paragraphs) {
const combined = currentChunk + '\n\n' + para;
if (estimateTokens(combined) > maxTokens) {
if (currentChunk) {
chunks.push(currentChunk.trim());
// Keep overlap for context continuity
currentChunk = getLastNTokens(currentChunk, overlap) + '\n\n' + para;
} else {
// Single paragraph too long, force split
chunks.push(...forceSplitParagraph(para, maxTokens));
currentChunk = '';
}
} else {
currentChunk = combined;
}
}
if (currentChunk.trim()) {
chunks.push(currentChunk.trim());
}
return chunks;
};
The overlap is crucial. Without it, information that spans chunk boundaries gets lost. We learned this when our document summarizer kept missing key points that happened to fall between chunks.
Designing the AI Processing Core
This is where the actual LLM calls happen. But a single call is rarely enough for complex tasks.
The Prompt Chain Pattern
Instead of one giant prompt trying to do everything, break complex tasks into focused steps:
[Understand] → [Plan] → [Execute] → [Verify] → [Format]
Example: Processing a customer complaint
const processComplaint = async (complaint) => {
// Step 1: Understand - Extract key information
const analysis = await llm.call({
system: `Extract structured information from customer complaints.
Return JSON with: issue_type, urgency, customer_emotion, key_details`,
user: complaint
});
// Step 2: Plan - Determine response strategy
const strategy = await llm.call({
system: `Based on this complaint analysis, determine the response strategy.
Consider: resolution options, escalation needs, compensation eligibility`,
user: JSON.stringify(analysis)
});
// Step 3: Execute - Generate response
const response = await llm.call({
system: `Write a customer response following this strategy.
Tone: empathetic, professional. Length: 2-3 paragraphs.`,
user: `Strategy: ${JSON.stringify(strategy)}\nOriginal complaint: ${complaint}`
});
// Step 4: Verify - Quality check
const verification = await llm.call({
system: `Review this customer response. Check for:
- Addresses all concerns? - Appropriate tone? - Actionable next steps?
Return: {approved: boolean, issues: string[]}`,
user: response
});
if (!verification.approved) {
return await regenerateWithFeedback(response, verification.issues);
}
return response;
};
Each step is simpler, more testable, and easier to debug. When something goes wrong, you know exactly which stage failed.
When to Chain vs. When to Parallelize
Not everything needs to be sequential. Here's how we decide:
| Pattern | Use When | Example |
|---|---|---|
| Sequential Chain | Each step depends on previous output | Understanding → Planning → Execution |
| Parallel Execution | Steps are independent | Analyzing multiple documents simultaneously |
| Map-Reduce | Need to process items then aggregate | Summarize each chunk, then combine summaries |
| Conditional Branching | Different paths for different inputs | Simple queries vs. complex analysis |
Parallel execution example:
const analyzeDocuments = async (documents) => {
// Process all documents in parallel
const analyses = await Promise.all(
documents.map(doc => analyzeDocument(doc))
);
// Reduce: combine into final report
const combinedReport = await llm.call({
system: 'Synthesize these individual analyses into a cohesive report',
user: analyses.map((a, i) => `Document ${i + 1}:\n${a}`).join('\n\n---\n\n')
});
return combinedReport;
};
Branching Logic: Routing to the Right Handler
Real-world inputs vary wildly. A "customer message" might be a complaint, a question, a compliment, or spam. Each needs different handling.
Classification-Based Routing
const routeCustomerMessage = async (message) => {
const classification = await llm.call({
system: `Classify this customer message into exactly one category:
- complaint: Customer expressing dissatisfaction
- question: Customer seeking information
- feedback: General feedback or suggestions
- urgent: Safety issues, legal threats, executive escalation
- spam: Irrelevant or automated messages
Return only the category name.`,
user: message
});
const handlers = {
complaint: handleComplaint,
question: handleQuestion,
feedback: handleFeedback,
urgent: escalateToHuman,
spam: markAsSpam
};
const handler = handlers[classification.toLowerCase()] || handleUnknown;
return await handler(message);
};
Confidence-Based Routing
Sometimes the AI isn't sure. Build that uncertainty into your routing:
const routeWithConfidence = async (input) => {
const result = await llm.call({
system: `Analyze and route this request. Return JSON:
{
"category": "string",
"confidence": 0.0-1.0,
"reasoning": "why this category"
}`,
user: input
});
if (result.confidence < 0.7) {
// Low confidence - get human input or use fallback
return await handleLowConfidence(input, result);
}
if (result.confidence < 0.9 && isHighStakes(result.category)) {
// Medium confidence on important decisions - verify
return await verifyThenRoute(input, result);
}
// High confidence - proceed automatically
return await routeToHandler(result.category, input);
};
Error Handling: Because Things Will Break
AI workflows fail in ways traditional software doesn't. The LLM might return invalid JSON, hallucinate information, or just... not follow instructions. Plan for it.
The Error Hierarchy
class AIWorkflowError extends Error {
constructor(message, stage, recoverable = true) {
super(message);
this.stage = stage;
this.recoverable = recoverable;
}
}
class ValidationError extends AIWorkflowError {
constructor(message) {
super(message, 'validation', true);
}
}
class LLMError extends AIWorkflowError {
constructor(message, type) {
super(message, 'llm_call', type !== 'rate_limit');
this.type = type; // 'timeout', 'rate_limit', 'invalid_response', 'refused'
}
}
class OutputError extends AIWorkflowError {
constructor(message) {
super(message, 'output', true);
}
}
Retry Strategies That Work
Different failures need different retry approaches:
| Error Type | Retry Strategy | Max Retries | Backoff |
|---|---|---|---|
| Rate limit | Wait and retry | 5 | Exponential with jitter |
| Timeout | Retry immediately | 3 | Linear |
| Invalid response | Retry with feedback | 2 | None |
| Model refused | Rephrase and retry | 2 | None |
| Server error | Wait and retry | 3 | Exponential |
const withRetry = async (operation, options = {}) => {
const {
maxRetries = 3,
backoffMs = 1000,
backoffMultiplier = 2,
shouldRetry = () => true
} = options;
let lastError;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error;
if (!shouldRetry(error) || attempt === maxRetries) {
throw error;
}
const delay = backoffMs * Math.pow(backoffMultiplier, attempt);
const jitter = Math.random() * delay * 0.1;
await sleep(delay + jitter);
}
}
throw lastError;
};
// Usage with LLM calls
const safeLLMCall = async (params) => {
return await withRetry(
() => llm.call(params),
{
maxRetries: 3,
shouldRetry: (error) => {
if (error.type === 'rate_limit') return true;
if (error.type === 'timeout') return true;
if (error.type === 'server_error') return true;
return false;
}
}
);
};
Self-Healing for Invalid Responses
When the LLM returns garbage, sometimes you can fix it:
const parseWithRecovery = async (llmResponse, expectedSchema) => {
// First attempt: direct parse
try {
const parsed = JSON.parse(llmResponse);
if (validateSchema(parsed, expectedSchema)) {
return parsed;
}
} catch (e) {
// JSON parse failed, continue to recovery
}
// Recovery attempt: ask LLM to fix its output
const fixed = await llm.call({
system: `The following response should be valid JSON matching this schema:
${JSON.stringify(expectedSchema)}
Fix the response to be valid JSON. Return ONLY the fixed JSON.`,
user: llmResponse
});
try {
const parsed = JSON.parse(fixed);
if (validateSchema(parsed, expectedSchema)) {
return parsed;
}
} catch (e) {
// Still broken
}
// Final fallback: extract what we can
return extractPartialData(llmResponse, expectedSchema);
};
Monitoring and Observability
You can't fix what you can't see. Here's what we track:
Key Metrics
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency (p50, p95, p99) | User experience, timeout risk | p95 > 10s |
| Success rate | Overall health | < 95% |
| Token usage | Cost control | > 150% of baseline |
| Retry rate | Hidden instability | > 10% |
| Classification distribution | Detect drift | Significant shift from baseline |
| Output quality scores | Catch degradation | Average < 0.8 |
Structured Logging
Every workflow run should produce traceable logs:
const runWorkflow = async (input, context) => {
const runId = generateRunId();
const startTime = Date.now();
const log = (stage, data) => {
logger.info({
runId,
stage,
timestamp: Date.now(),
elapsed: Date.now() - startTime,
...data
});
};
try {
log('start', { inputSize: input.length });
const validated = await validate(input);
log('validated', { valid: true });
const processed = await process(validated);
log('processed', {
tokensUsed: processed.usage.total,
model: processed.model
});
const output = await format(processed);
log('complete', {
success: true,
outputSize: output.length,
totalTime: Date.now() - startTime
});
return output;
} catch (error) {
log('error', {
error: error.message,
stage: error.stage,
recoverable: error.recoverable
});
throw error;
}
};
Putting It All Together: A Complete Example
Let's build a document analysis workflow that uses everything we've discussed:
class DocumentAnalysisPipeline {
constructor(options = {}) {
this.maxChunkSize = options.maxChunkSize || 3000;
this.concurrency = options.concurrency || 5;
}
async run(document) {
// Stage 1: Validate
const validated = await this.validate(document);
// Stage 2: Preprocess
const chunks = await this.preprocess(validated);
// Stage 3: Parallel analysis with concurrency control
const analyses = await this.analyzeChunks(chunks);
// Stage 4: Synthesize
const synthesis = await this.synthesize(analyses);
// Stage 5: Quality check
const final = await this.qualityCheck(synthesis, document);
return final;
}
async validate(document) {
if (!document || typeof document !== 'string') {
throw new ValidationError('Document must be a non-empty string');
}
if (document.length > 1000000) {
throw new ValidationError('Document exceeds maximum size');
}
return document;
}
async preprocess(document) {
return chunkDocument(document, {
maxTokens: this.maxChunkSize,
overlap: 200
});
}
async analyzeChunks(chunks) {
const results = [];
// Process in batches to control concurrency
for (let i = 0; i < chunks.length; i += this.concurrency) {
const batch = chunks.slice(i, i + this.concurrency);
const batchResults = await Promise.all(
batch.map((chunk, idx) => this.analyzeChunk(chunk, i + idx))
);
results.push(...batchResults);
}
return results;
}
async analyzeChunk(chunk, index) {
return await withRetry(async () => {
const result = await llm.call({
system: `Analyze this document section. Extract:
- Key topics and themes
- Important facts and figures
- Notable quotes or statements
- Questions or gaps in information
Return structured JSON.`,
user: `Section ${index + 1}:\n\n${chunk}`
});
return parseWithRecovery(result, ANALYSIS_SCHEMA);
});
}
async synthesize(analyses) {
const combined = analyses.map((a, i) =>
`Section ${i + 1}:\n${JSON.stringify(a, null, 2)}`
).join('\n\n---\n\n');
return await llm.call({
system: `Synthesize these section analyses into a comprehensive document summary.
Structure:
1. Executive Summary (2-3 sentences)
2. Key Findings (bullet points)
3. Important Details
4. Gaps or Questions
5. Recommendations`,
user: combined
});
}
async qualityCheck(synthesis, originalDocument) {
const check = await llm.call({
system: `Review this analysis for quality. Check:
- Accuracy: Does it reflect the original document?
- Completeness: Are major points covered?
- Clarity: Is it well-organized and clear?
Return: {score: 0-1, issues: string[], approved: boolean}`,
user: `Analysis:\n${synthesis}\n\nOriginal (first 2000 chars):\n${originalDocument.slice(0, 2000)}`
});
if (!check.approved) {
// Log for review but don't fail
logger.warn({ issues: check.issues, score: check.score });
}
return {
analysis: synthesis,
qualityScore: check.score,
qualityIssues: check.issues
};
}
}
Common Pitfalls and How to Avoid Them
After building dozens of these systems, here are the mistakes we see most often:
1. No Timeout Boundaries
Every LLM call needs a timeout. Set them aggressively.
// Bad: No timeout
const result = await llm.call(params);
// Good: Explicit timeout
const result = await Promise.race([
llm.call(params),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 30000)
)
]);
2. Ignoring Token Limits
Track usage and set budgets:
const tokenBudget = {
max: 10000,
used: 0,
canSpend(amount) {
return this.used + amount <= this.max;
},
spend(amount) {
this.used += amount;
if (this.used > this.max * 0.8) {
logger.warn('Token budget 80% consumed');
}
}
};
3. No Fallback for Critical Paths
Always have a plan B:
const processWithFallback = async (input) => {
try {
return await primaryProcess(input);
} catch (error) {
if (error.recoverable) {
return await simplifiedProcess(input);
}
// Critical path - queue for manual processing
await queueForManualReview(input, error);
return { status: 'queued_for_review' };
}
};
What's Next
AI workflows are getting more sophisticated. Here's where we see things heading:
- Better tool use: Models are getting better at deciding when and how to use external tools
- Longer context: Bigger context windows mean fewer chunking headaches
- Faster inference: Latency is dropping, enabling more complex real-time workflows
- Specialized models: Fine-tuned models for specific tasks outperform general-purpose ones
But the fundamentals don't change. Validate inputs, handle errors gracefully, monitor everything, and always have a fallback. Build on those principles, and your AI workflows will survive contact with the real world.
If you're building AI workflows and hitting walls, we've probably seen your problem before. Reach out, and let's figure it out together.
Topics covered
Ready to implement agentic AI?
Our team specializes in building production-ready AI systems. Let's discuss how we can help you leverage agentic AI for your enterprise.
Start a conversation