RAG Is Not Enough: What Reliable AI Systems Need on Top
Where RAG breaks in production and what to build on top. Chunk quality, orchestration layers, hybrid search, hallucination boundaries, cost management, and when to skip RAG entirely.
The RAG Demo Trap
Every RAG demo works. Upload some PDFs, chunk them, embed them, query them, get an answer. The demo is always impressive. The stakeholders are excited. The team estimates two weeks to production.
Six months later, the system is still unreliable. Users get wrong answers from outdated chunks. The model confidently cites documents that say the opposite of what the answer claims. Costs are 10x the original estimate. And nobody can figure out why the same question produces different answers depending on the time of day.
RAG is not a solution. RAG is a retrieval pattern. A reliable AI system needs an orchestration layer, quality controls, hybrid search, hallucination boundaries, cost management, and monitoring on top of RAG. This article covers what we've learned building production RAG systems.
For broader context on enterprise RAG architecture and vector search, those guides cover the foundational patterns. This article focuses on where those patterns break and what you need beyond them.
Where RAG Breaks
| Failure Mode | What Happens | How Common |
|---|---|---|
| Chunk quality | Wrong chunk boundaries split context, answer is based on partial information | Very common |
| Stale data | Index not updated, answer is based on outdated document | Common |
| Retrieval miss | Relevant document exists but embedding similarity doesn't surface it | Common |
| Hallucination despite retrieval | Model ignores retrieved context and generates from training data | Common |
| Context window overflow | Too many chunks retrieved, model loses focus | Moderate |
| Cross-document confusion | Chunks from different documents mixed, model blends contradictory facts | Moderate |
| Cost explosion | Embedding + retrieval + generation costs scale with query volume | Gradual |
| Latency spikes | Vector search + reranking + generation takes too long for interactive use | Moderate |
Chunk Quality Is Everything
The most underestimated problem. If your chunks split a paragraph in the middle of a thought, the retrieved context is incomplete. If your chunks are too large, irrelevant content dilutes the useful information. If your chunks don't preserve document structure (headings, tables, lists), the model loses the organizational context.
// Bad: fixed-size chunks break context
function naiveChunk(text: string, size: number): string[] {
const chunks = [];
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size));
}
return chunks;
// Problem: splits sentences, paragraphs, tables mid-content
}
// Better: semantic chunking with overlap
function semanticChunk(text: string, options: ChunkOptions): Chunk[] {
const sections = splitByHeadings(text); // Respect document structure
const paragraphs = sections.flatMap(s =>
splitByParagraphs(s, { maxSize: options.maxChunkSize })
);
return paragraphs.map((p, i) => ({
content: p.text,
metadata: {
section: p.sectionTitle,
pageNumber: p.pageNumber,
documentId: p.documentId,
position: i,
},
// Overlap: include last 2 sentences from previous chunk
prefix: i > 0 ? getLastSentences(paragraphs[i - 1].text, 2) : '',
}));
}
The overlap matters. Without it, a question that spans two chunks gets partial context from each and a complete answer from neither. With 2-3 sentence overlap, the model has enough context to bridge chunk boundaries.
Chunk metadata is equally critical. Every chunk must carry its source document ID, section title, page number, and position. Without metadata, you can't tell the user where the answer came from. Without source attribution, the answer is unverifiable.
Retrieval Quality vs Retrieval Quantity
Retrieving more chunks doesn't mean better answers. In practice, we've found that 3-5 high-quality chunks consistently outperform 10-15 mediocre chunks.
| Chunks Retrieved | Answer Quality | Latency | Cost |
|---|---|---|---|
| 1-2 | Risk of missing context | Fast | Low |
| 3-5 | Best balance (recommended) | Moderate | Moderate |
| 5-10 | Diminishing returns, some noise | Slower | Higher |
| 10+ | Context dilution, model confused | Slow | High |
The solution: retrieve broadly, then rerank aggressively.
async function retrieveAndRerank(query: string, options: RetrievalOptions) {
// Step 1: Broad retrieval (get 20 candidates)
const candidates = await vectorStore.search(query, { limit: 20 });
// Step 2: Rerank with a cross-encoder (score each candidate against the query)
const reranked = await reranker.rank(query, candidates, {
model: 'cross-encoder/ms-marco-MiniLM-L-12-v2',
});
// Step 3: Take top 5 after reranking
const topChunks = reranked.slice(0, 5);
// Step 4: Filter by minimum relevance score
return topChunks.filter(c => c.score > options.minRelevanceScore);
}
The reranker is a cross-encoder model that scores each candidate against the query with much higher accuracy than cosine similarity on embeddings. It's slower (runs inference per candidate), but the quality improvement is substantial. Running it on 20 candidates to select 5 adds 100-200ms of latency, which is acceptable for most use cases.
The Orchestration Layer RAG Needs
Raw RAG is: embed query, search vectors, stuff context into prompt, generate. A production system needs an orchestration layer between retrieval and generation.
User Query
β
βΌ
ββββββββββββββββββββ
β Query Analysis β Classify intent, extract entities, detect language
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Routing β Which index? Which retrieval strategy? Cache hit?
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Retrieval β Vector search + keyword search (hybrid)
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Reranking β Cross-encoder scoring, filter low-relevance chunks
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Context Assembly β Order chunks, add metadata, respect token budget
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Generation β LLM call with assembled context + system prompt
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Output Validationβ Check for hallucination, verify citations, PII scan
ββββββββββ¬ββββββββββ
β
βΌ
Response
Query Analysis
Not every query needs RAG. Some queries are conversational ("hello", "thanks"). Some are about the system itself ("how do I use this tool?"). Some are ambiguous and need clarification. The query analyzer classifies intent before triggering retrieval.
async function analyzeQuery(query: string): Promise<QueryAnalysis> {
// Fast classification (can be a small model or rule-based)
const intent = await classifyIntent(query);
if (intent === 'greeting' || intent === 'meta') {
return { needsRetrieval: false, intent, response: getStaticResponse(intent) };
}
if (intent === 'ambiguous') {
return { needsRetrieval: false, intent, clarificationNeeded: true };
}
return {
needsRetrieval: true,
intent,
extractedEntities: await extractEntities(query),
detectedLanguage: await detectLanguage(query),
};
}
Routing
Different queries might need different indices, different retrieval strategies, or different models.
| Query Type | Index | Strategy | Model |
|---|---|---|---|
| Product question | Products index | Hybrid (text + vector) | Fast model (GPT-4o-mini) |
| Legal/compliance question | Policies index | Vector only (precise) | Accurate model (GPT-4o) |
| Technical support | Knowledge base index | Hybrid + rerank | Fast model |
| Multi-language query | Multilingual index | Vector with language filter | Multilingual model |
Context Assembly
After retrieval and reranking, chunks must be assembled into a prompt that respects the model's token budget.
function assembleContext(chunks: RankedChunk[], tokenBudget: number): string {
let context = '';
let tokensUsed = 0;
for (const chunk of chunks) {
const chunkTokens = estimateTokens(chunk.content);
if (tokensUsed + chunkTokens > tokenBudget) break;
context += `\n\n---\nSource: ${chunk.metadata.documentTitle} (${chunk.metadata.section})\n`;
context += chunk.content;
tokensUsed += chunkTokens;
}
return context;
}
The token budget must account for the system prompt, the user query, the assembled context, AND the expected response length. A common mistake is filling the entire context window with retrieved chunks, leaving no room for a quality response.
Hybrid Search: Text + Vector
Pure vector search misses keyword-specific queries. A user searching for "error code E-4021" will get poor results from embedding similarity because error codes are not semantically meaningful. Pure text search misses semantic queries. A user searching for "how to fix login problems" won't find a document titled "Authentication Troubleshooting Guide."
Hybrid search combines both:
async function hybridSearch(query: string, options: SearchOptions) {
// Parallel execution
const [vectorResults, textResults] = await Promise.all([
vectorStore.search(query, { limit: options.vectorLimit }),
textIndex.search(query, { limit: options.textLimit }),
]);
// Reciprocal Rank Fusion (RRF) to merge results
const merged = reciprocalRankFusion(vectorResults, textResults, {
vectorWeight: 0.6,
textWeight: 0.4,
});
return merged.slice(0, options.totalLimit);
}
function reciprocalRankFusion(
vectorResults: SearchResult[],
textResults: SearchResult[],
weights: { vectorWeight: number; textWeight: number },
): SearchResult[] {
const scores = new Map<string, number>();
const k = 60; // RRF constant
vectorResults.forEach((result, rank) => {
const score = (scores.get(result.id) || 0) + weights.vectorWeight / (k + rank + 1);
scores.set(result.id, score);
});
textResults.forEach((result, rank) => {
const score = (scores.get(result.id) || 0) + weights.textWeight / (k + rank + 1);
scores.set(result.id, score);
});
return Array.from(scores.entries())
.sort(([, a], [, b]) => b - a)
.map(([id, score]) => ({ id, score }));
}
The weight ratio (vector 0.6, text 0.4) is a starting point. Tune it based on your query distribution. If most queries are keyword-heavy (product SKUs, error codes), increase text weight. If most queries are natural language, increase vector weight.
For more on search architecture in commerce contexts, see our ecommerce platforms guide.
Hallucination Boundaries
RAG reduces hallucination compared to pure LLM generation. It does not eliminate it. The model can still:
- Ignore retrieved context and generate from training data
- Blend information from multiple chunks incorrectly
- Invent citations that don't exist in the retrieved context
- Extrapolate beyond what the context supports
Mitigation Strategies
1. Constrained system prompts:
You are a support assistant. Answer ONLY based on the provided context.
If the context does not contain enough information to answer, say
"I don't have enough information to answer that question."
Do NOT use information from your training data.
Every claim must reference a specific source from the context.
2. Citation verification:
async function verifyCitations(response: string, chunks: RankedChunk[]): VerificationResult {
const citations = extractCitations(response);
const verified = [];
const unverified = [];
for (const citation of citations) {
const found = chunks.some(chunk =>
chunk.content.includes(citation.claimedText) ||
fuzzyMatch(chunk.content, citation.claimedText, 0.85)
);
(found ? verified : unverified).push(citation);
}
return {
allVerified: unverified.length === 0,
verified,
unverified,
confidenceScore: verified.length / (verified.length + unverified.length),
};
}
3. Confidence scoring:
If the model's response doesn't align well with the retrieved context (low overlap, no direct quotes), flag it as low confidence. Show a warning to the user or escalate to a human.
For more on AI failure modes and how to handle them, see our AI failure modes guide.
Cost and Latency
RAG costs scale with query volume across three dimensions:
| Component | Cost Driver | Typical Range |
|---|---|---|
| Embedding query | Per query (model inference) | $0.0001 per query |
| Vector search | Per query (compute + I/O) | $0.0005 per query |
| Reranking | Per query * candidates (model inference) | $0.001 per query |
| LLM generation | Input tokens (context) + output tokens | $0.01-0.10 per query |
| Embedding documents | One-time per document (on ingest) | $0.0001 per page |
LLM generation dominates cost. Reducing context size (fewer chunks, shorter chunks) directly reduces the most expensive component.
Caching Strategies
// Semantic cache: cache responses for similar queries
async function cachedQuery(query: string): Promise<string | null> {
// Embed the query
const queryEmbedding = await embedder.embed(query);
// Search cache index for similar queries
const cached = await cacheIndex.search(queryEmbedding, {
minSimilarity: 0.95, // High threshold for cache hits
limit: 1,
});
if (cached.length > 0) {
return cached[0].response; // Cache hit
}
return null; // Cache miss, proceed with full RAG pipeline
}
Semantic caching works because many users ask similar questions in slightly different ways. "How do I reset my password?" and "Password reset instructions" are different strings but semantically identical. A 0.95 similarity threshold ensures only near-identical queries get cached responses.
Latency Budget
For interactive use (chatbot, support assistant), the total pipeline must complete in under 3 seconds:
| Stage | Budget | Optimization |
|---|---|---|
| Query analysis | 50ms | Rule-based or small model |
| Cache check | 30ms | In-memory vector index |
| Vector search | 100ms | Dedicated search cluster |
| Text search | 100ms | Parallel with vector search |
| Reranking | 200ms | Small cross-encoder, limit candidates |
| Context assembly | 10ms | In-memory |
| LLM generation | 1,500ms | Streaming, fast model |
| Output validation | 100ms | Rule-based + small model |
| Total | ~2,100ms |
Streaming the LLM response to the user while generation is in progress makes the perceived latency much lower. The user sees the first tokens in 300-500ms even though the full response takes 1,500ms.
When to Skip RAG Entirely
RAG is not always the right pattern. Sometimes simpler approaches work better:
| Scenario | Better Approach | Why |
|---|---|---|
| Static FAQ (< 50 questions) | Keyword match + template response | Faster, cheaper, deterministic |
| Structured data queries | SQL/API query + template | LLM adds latency and hallucination risk |
| Real-time data (stock prices, inventory) | Direct API call | Embeddings are stale by definition |
| Simple classification | Fine-tuned classifier | Cheaper, faster, more reliable |
| Document summarization | Direct LLM call (no retrieval) | The full document IS the context |
RAG makes sense when you have a large knowledge base (hundreds to thousands of documents), natural language queries that can't be keyword-matched, and a need for synthesized answers from multiple sources. If your use case doesn't match this profile, a simpler approach will be more reliable and cheaper.
For how we approach these architecture decisions in our AI services practice, and for broader patterns in AI workflow design, those pages provide more context.
Common Pitfalls
-
Fixed-size chunking. Chunks that split paragraphs, tables, or code blocks mid-content produce garbage retrieval. Use semantic chunking that respects document structure.
-
No chunk overlap. Without overlap, queries that span chunk boundaries get partial context from each and a complete answer from neither.
-
No reranking. Embedding similarity is a rough filter. A cross-encoder reranker dramatically improves the quality of the top-5 results.
-
Filling the entire context window. Leave room for the system prompt, user query, AND the expected response. A prompt stuffed with 15 chunks leaves no room for a quality answer.
-
No hybrid search. Pure vector search fails on keyword queries (error codes, product SKUs). Pure text search fails on semantic queries. Use both.
-
No semantic caching. Similar questions asked by different users trigger the full RAG pipeline every time. A semantic cache with 0.95 similarity threshold reduces costs significantly.
-
Trusting RAG output without verification. RAG reduces hallucination. It doesn't eliminate it. Verify citations against the retrieved context. Flag unverified claims.
-
No monitoring. You need to track retrieval quality (were the right chunks retrieved?), answer quality (did the user find the answer helpful?), latency, cost per query, and cache hit rate.
Key Takeaways
-
RAG is a retrieval pattern, not a solution. A reliable system needs query analysis, routing, hybrid search, reranking, context assembly, generation, and output validation on top of retrieval.
-
Chunk quality determines answer quality. Semantic chunking with overlap, metadata, and document structure preservation is the foundation. Everything else is built on top of good chunks.
-
Retrieve broadly, rerank aggressively. Get 20 candidates with embedding similarity. Score them with a cross-encoder. Take the top 5. Filter by minimum relevance.
-
Hybrid search handles what vectors miss. Keywords, error codes, product IDs, and exact matches need text search. Semantic queries need vector search. Use both with reciprocal rank fusion.
-
LLM generation dominates cost. Reducing context size (fewer, better chunks) is the most effective cost optimization.
-
Sometimes RAG is the wrong pattern. Static FAQs, structured data queries, real-time data, and simple classification all have simpler, more reliable solutions.
We build production RAG systems as part of our AI services and data engineering practice. If you're building a RAG system or debugging one that's unreliable, talk to our team or request a quote. You can also explore our methodology page for how we approach AI projects.
Topics covered
Related Guides
AI Failure Modes: A Production Engineering Guide
Technical guide to AI failures in production. Learn about hallucinations, context limits, prompt injection, model drift, and building resilient AI apps.
Read guideEnterprise RAG Systems: A Technical Deep Dive
Technical guide to building production-ready RAG systems at scale. Learn chunking strategies, embedding models, retrieval optimization, and hybrid search.
Read guideEnterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation