Technical Guide

RAG Is Not Enough: What Reliable AI Systems Need on Top

Where RAG breaks in production and what to build on top. Chunk quality, orchestration layers, hybrid search, hallucination boundaries, cost management, and when to skip RAG entirely.

February 18, 202618 min readOronts Engineering Team

The RAG Demo Trap

Every RAG demo works. Upload some PDFs, chunk them, embed them, query them, get an answer. The demo is always impressive. The stakeholders are excited. The team estimates two weeks to production.

Six months later, the system is still unreliable. Users get wrong answers from outdated chunks. The model confidently cites documents that say the opposite of what the answer claims. Costs are 10x the original estimate. And nobody can figure out why the same question produces different answers depending on the time of day.

RAG is not a solution. RAG is a retrieval pattern. A reliable AI system needs an orchestration layer, quality controls, hybrid search, hallucination boundaries, cost management, and monitoring on top of RAG. This article covers what we've learned building production RAG systems.

For broader context on enterprise RAG architecture and vector search, those guides cover the foundational patterns. This article focuses on where those patterns break and what you need beyond them.

Where RAG Breaks

Failure Mode	What Happens	How Common
Chunk quality	Wrong chunk boundaries split context, answer is based on partial information	Very common
Stale data	Index not updated, answer is based on outdated document	Common
Retrieval miss	Relevant document exists but embedding similarity doesn't surface it	Common
Hallucination despite retrieval	Model ignores retrieved context and generates from training data	Common
Context window overflow	Too many chunks retrieved, model loses focus	Moderate
Cross-document confusion	Chunks from different documents mixed, model blends contradictory facts	Moderate
Cost explosion	Embedding + retrieval + generation costs scale with query volume	Gradual
Latency spikes	Vector search + reranking + generation takes too long for interactive use	Moderate

Chunk Quality Is Everything

The most underestimated problem. If your chunks split a paragraph in the middle of a thought, the retrieved context is incomplete. If your chunks are too large, irrelevant content dilutes the useful information. If your chunks don't preserve document structure (headings, tables, lists), the model loses the organizational context.

// Bad: fixed-size chunks break context
function naiveChunk(text: string, size: number): string[] {
    const chunks = [];
    for (let i = 0; i < text.length; i += size) {
        chunks.push(text.slice(i, i + size));
    }
    return chunks;
    // Problem: splits sentences, paragraphs, tables mid-content
}

// Better: semantic chunking with overlap
function semanticChunk(text: string, options: ChunkOptions): Chunk[] {
    const sections = splitByHeadings(text);      // Respect document structure
    const paragraphs = sections.flatMap(s =>
        splitByParagraphs(s, { maxSize: options.maxChunkSize })
    );

    return paragraphs.map((p, i) => ({
        content: p.text,
        metadata: {
            section: p.sectionTitle,
            pageNumber: p.pageNumber,
            documentId: p.documentId,
            position: i,
        },
        // Overlap: include last 2 sentences from previous chunk
        prefix: i > 0 ? getLastSentences(paragraphs[i - 1].text, 2) : '',
    }));
}

The overlap matters. Without it, a question that spans two chunks gets partial context from each and a complete answer from neither. With 2-3 sentence overlap, the model has enough context to bridge chunk boundaries.

Chunk metadata is equally critical. Every chunk must carry its source document ID, section title, page number, and position. Without metadata, you can't tell the user where the answer came from. Without source attribution, the answer is unverifiable.

Retrieval Quality vs Retrieval Quantity

Retrieving more chunks doesn't mean better answers. In practice, we've found that 3-5 high-quality chunks consistently outperform 10-15 mediocre chunks.

Chunks Retrieved	Answer Quality	Latency	Cost
1-2	Risk of missing context	Fast	Low
3-5	Best balance (recommended)	Moderate	Moderate
5-10	Diminishing returns, some noise	Slower	Higher
10+	Context dilution, model confused	Slow	High

The solution: retrieve broadly, then rerank aggressively.

async function retrieveAndRerank(query: string, options: RetrievalOptions) {
    // Step 1: Broad retrieval (get 20 candidates)
    const candidates = await vectorStore.search(query, { limit: 20 });

    // Step 2: Rerank with a cross-encoder (score each candidate against the query)
    const reranked = await reranker.rank(query, candidates, {
        model: 'cross-encoder/ms-marco-MiniLM-L-12-v2',
    });

    // Step 3: Take top 5 after reranking
    const topChunks = reranked.slice(0, 5);

    // Step 4: Filter by minimum relevance score
    return topChunks.filter(c => c.score > options.minRelevanceScore);
}

The reranker is a cross-encoder model that scores each candidate against the query with much higher accuracy than cosine similarity on embeddings. It's slower (runs inference per candidate), but the quality improvement is substantial. Running it on 20 candidates to select 5 adds 100-200ms of latency, which is acceptable for most use cases.

The Orchestration Layer RAG Needs

Raw RAG is: embed query, search vectors, stuff context into prompt, generate. A production system needs an orchestration layer between retrieval and generation.

User Query
    │
    ▼
┌──────────────────┐
│  Query Analysis   │  Classify intent, extract entities, detect language
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Routing          │  Which index? Which retrieval strategy? Cache hit?
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Retrieval        │  Vector search + keyword search (hybrid)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Reranking        │  Cross-encoder scoring, filter low-relevance chunks
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Context Assembly │  Order chunks, add metadata, respect token budget
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Generation       │  LLM call with assembled context + system prompt
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Output Validation│  Check for hallucination, verify citations, PII scan
└────────┬─────────┘
         │
         ▼
    Response

Query Analysis

Not every query needs RAG. Some queries are conversational ("hello", "thanks"). Some are about the system itself ("how do I use this tool?"). Some are ambiguous and need clarification. The query analyzer classifies intent before triggering retrieval.

async function analyzeQuery(query: string): Promise<QueryAnalysis> {
    // Fast classification (can be a small model or rule-based)
    const intent = await classifyIntent(query);

    if (intent === 'greeting' || intent === 'meta') {
        return { needsRetrieval: false, intent, response: getStaticResponse(intent) };
    }

    if (intent === 'ambiguous') {
        return { needsRetrieval: false, intent, clarificationNeeded: true };
    }

    return {
        needsRetrieval: true,
        intent,
        extractedEntities: await extractEntities(query),
        detectedLanguage: await detectLanguage(query),
    };
}

Routing

Different queries might need different indices, different retrieval strategies, or different models.

Query Type	Index	Strategy	Model
Product question	Products index	Hybrid (text + vector)	Fast model (GPT-4o-mini)
Legal/compliance question	Policies index	Vector only (precise)	Accurate model (GPT-4o)
Technical support	Knowledge base index	Hybrid + rerank	Fast model
Multi-language query	Multilingual index	Vector with language filter	Multilingual model

Context Assembly

After retrieval and reranking, chunks must be assembled into a prompt that respects the model's token budget.

function assembleContext(chunks: RankedChunk[], tokenBudget: number): string {
    let context = '';
    let tokensUsed = 0;

    for (const chunk of chunks) {
        const chunkTokens = estimateTokens(chunk.content);
        if (tokensUsed + chunkTokens > tokenBudget) break;

        context += `\n\n---\nSource: ${chunk.metadata.documentTitle} (${chunk.metadata.section})\n`;
        context += chunk.content;
        tokensUsed += chunkTokens;
    }

    return context;
}

The token budget must account for the system prompt, the user query, the assembled context, AND the expected response length. A common mistake is filling the entire context window with retrieved chunks, leaving no room for a quality response.

Hybrid Search: Text + Vector

Pure vector search misses keyword-specific queries. A user searching for "error code E-4021" will get poor results from embedding similarity because error codes are not semantically meaningful. Pure text search misses semantic queries. A user searching for "how to fix login problems" won't find a document titled "Authentication Troubleshooting Guide."

Hybrid search combines both:

async function hybridSearch(query: string, options: SearchOptions) {
    // Parallel execution
    const [vectorResults, textResults] = await Promise.all([
        vectorStore.search(query, { limit: options.vectorLimit }),
        textIndex.search(query, { limit: options.textLimit }),
    ]);

    // Reciprocal Rank Fusion (RRF) to merge results
    const merged = reciprocalRankFusion(vectorResults, textResults, {
        vectorWeight: 0.6,
        textWeight: 0.4,
    });

    return merged.slice(0, options.totalLimit);
}

function reciprocalRankFusion(
    vectorResults: SearchResult[],
    textResults: SearchResult[],
    weights: { vectorWeight: number; textWeight: number },
): SearchResult[] {
    const scores = new Map<string, number>();
    const k = 60; // RRF constant

    vectorResults.forEach((result, rank) => {
        const score = (scores.get(result.id) || 0) + weights.vectorWeight / (k + rank + 1);
        scores.set(result.id, score);
    });

    textResults.forEach((result, rank) => {
        const score = (scores.get(result.id) || 0) + weights.textWeight / (k + rank + 1);
        scores.set(result.id, score);
    });

    return Array.from(scores.entries())
        .sort(([, a], [, b]) => b - a)
        .map(([id, score]) => ({ id, score }));
}

The weight ratio (vector 0.6, text 0.4) is a starting point. Tune it based on your query distribution. If most queries are keyword-heavy (product SKUs, error codes), increase text weight. If most queries are natural language, increase vector weight.

For more on search architecture in commerce contexts, see our ecommerce platforms guide.

Hallucination Boundaries

RAG reduces hallucination compared to pure LLM generation. It does not eliminate it. The model can still:

Ignore retrieved context and generate from training data
Blend information from multiple chunks incorrectly
Invent citations that don't exist in the retrieved context
Extrapolate beyond what the context supports

Mitigation Strategies

1. Constrained system prompts:

You are a support assistant. Answer ONLY based on the provided context.
If the context does not contain enough information to answer, say
"I don't have enough information to answer that question."
Do NOT use information from your training data.
Every claim must reference a specific source from the context.

2. Citation verification:

async function verifyCitations(response: string, chunks: RankedChunk[]): VerificationResult {
    const citations = extractCitations(response);
    const verified = [];
    const unverified = [];

    for (const citation of citations) {
        const found = chunks.some(chunk =>
            chunk.content.includes(citation.claimedText) ||
            fuzzyMatch(chunk.content, citation.claimedText, 0.85)
        );
        (found ? verified : unverified).push(citation);
    }

    return {
        allVerified: unverified.length === 0,
        verified,
        unverified,
        confidenceScore: verified.length / (verified.length + unverified.length),
    };
}

3. Confidence scoring:

If the model's response doesn't align well with the retrieved context (low overlap, no direct quotes), flag it as low confidence. Show a warning to the user or escalate to a human.

For more on AI failure modes and how to handle them, see our AI failure modes guide.

Cost and Latency

RAG costs scale with query volume across three dimensions:

Component	Cost Driver	Typical Range
Embedding query	Per query (model inference)	$0.0001 per query
Vector search	Per query (compute + I/O)	$0.0005 per query
Reranking	Per query * candidates (model inference)	$0.001 per query
LLM generation	Input tokens (context) + output tokens	$0.01-0.10 per query
Embedding documents	One-time per document (on ingest)	$0.0001 per page

LLM generation dominates cost. Reducing context size (fewer chunks, shorter chunks) directly reduces the most expensive component.

Caching Strategies

// Semantic cache: cache responses for similar queries
async function cachedQuery(query: string): Promise<string | null> {
    // Embed the query
    const queryEmbedding = await embedder.embed(query);

    // Search cache index for similar queries
    const cached = await cacheIndex.search(queryEmbedding, {
        minSimilarity: 0.95,  // High threshold for cache hits
        limit: 1,
    });

    if (cached.length > 0) {
        return cached[0].response;  // Cache hit
    }

    return null;  // Cache miss, proceed with full RAG pipeline
}

Semantic caching works because many users ask similar questions in slightly different ways. "How do I reset my password?" and "Password reset instructions" are different strings but semantically identical. A 0.95 similarity threshold ensures only near-identical queries get cached responses.

Latency Budget

For interactive use (chatbot, support assistant), the total pipeline must complete in under 3 seconds:

Stage	Budget	Optimization
Query analysis	50ms	Rule-based or small model
Cache check	30ms	In-memory vector index
Vector search	100ms	Dedicated search cluster
Text search	100ms	Parallel with vector search
Reranking	200ms	Small cross-encoder, limit candidates
Context assembly	10ms	In-memory
LLM generation	1,500ms	Streaming, fast model
Output validation	100ms	Rule-based + small model
Total	~2,100ms

Streaming the LLM response to the user while generation is in progress makes the perceived latency much lower. The user sees the first tokens in 300-500ms even though the full response takes 1,500ms.

When to Skip RAG Entirely

RAG is not always the right pattern. Sometimes simpler approaches work better:

Scenario	Better Approach	Why
Static FAQ (< 50 questions)	Keyword match + template response	Faster, cheaper, deterministic
Structured data queries	SQL/API query + template	LLM adds latency and hallucination risk
Real-time data (stock prices, inventory)	Direct API call	Embeddings are stale by definition
Simple classification	Fine-tuned classifier	Cheaper, faster, more reliable
Document summarization	Direct LLM call (no retrieval)	The full document IS the context

RAG makes sense when you have a large knowledge base (hundreds to thousands of documents), natural language queries that can't be keyword-matched, and a need for synthesized answers from multiple sources. If your use case doesn't match this profile, a simpler approach will be more reliable and cheaper.

For how we approach these architecture decisions in our AI services practice, and for broader patterns in AI workflow design, those pages provide more context.

Common Pitfalls

Fixed-size chunking. Chunks that split paragraphs, tables, or code blocks mid-content produce garbage retrieval. Use semantic chunking that respects document structure.
No chunk overlap. Without overlap, queries that span chunk boundaries get partial context from each and a complete answer from neither.
No reranking. Embedding similarity is a rough filter. A cross-encoder reranker dramatically improves the quality of the top-5 results.
Filling the entire context window. Leave room for the system prompt, user query, AND the expected response. A prompt stuffed with 15 chunks leaves no room for a quality answer.
No hybrid search. Pure vector search fails on keyword queries (error codes, product SKUs). Pure text search fails on semantic queries. Use both.
No semantic caching. Similar questions asked by different users trigger the full RAG pipeline every time. A semantic cache with 0.95 similarity threshold reduces costs significantly.
Trusting RAG output without verification. RAG reduces hallucination. It doesn't eliminate it. Verify citations against the retrieved context. Flag unverified claims.
No monitoring. You need to track retrieval quality (were the right chunks retrieved?), answer quality (did the user find the answer helpful?), latency, cost per query, and cache hit rate.

Key Takeaways

RAG is a retrieval pattern, not a solution. A reliable system needs query analysis, routing, hybrid search, reranking, context assembly, generation, and output validation on top of retrieval.
Chunk quality determines answer quality. Semantic chunking with overlap, metadata, and document structure preservation is the foundation. Everything else is built on top of good chunks.
Retrieve broadly, rerank aggressively. Get 20 candidates with embedding similarity. Score them with a cross-encoder. Take the top 5. Filter by minimum relevance.
Hybrid search handles what vectors miss. Keywords, error codes, product IDs, and exact matches need text search. Semantic queries need vector search. Use both with reciprocal rank fusion.
LLM generation dominates cost. Reducing context size (fewer, better chunks) is the most effective cost optimization.
Sometimes RAG is the wrong pattern. Static FAQs, structured data queries, real-time data, and simple classification all have simpler, more reliable solutions.

We build production RAG systems as part of our AI services and data engineering practice. If you're building a RAG system or debugging one that's unreliable, talk to our team or request a quote. You can also explore our methodology page for how we approach AI projects.

Topics covered

RAG productionRAG limitationsAI reliabilityretrieval augmented generation problemsRAG architecturehybrid searchRAG orchestrationRAG hallucination

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

RAG Is Not Enough: What Reliable AI Systems Need on Top

The RAG Demo Trap

Where RAG Breaks

Chunk Quality Is Everything

Retrieval Quality vs Retrieval Quantity

The Orchestration Layer RAG Needs

Query Analysis

Routing

Context Assembly

Hybrid Search: Text + Vector

Hallucination Boundaries

Mitigation Strategies

Cost and Latency

Caching Strategies

Latency Budget

When to Skip RAG Entirely

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

AI Failure Modes: A Production Engineering Guide

Enterprise RAG Systems: A Technical Deep Dive

Enterprise Guide to Agentic AI Systems

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal