Technical Guide

RAG Is Not Enough: What Reliable AI Systems Need on Top

Where RAG breaks in production and what to build on top. Chunk quality, orchestration layers, hybrid search, hallucination boundaries, cost management, and when to skip RAG entirely.

February 18, 202618 min readOronts Engineering Team

The RAG Demo Trap

Every RAG demo works. Upload some PDFs, chunk them, embed them, query them, get an answer. The demo is always impressive. The stakeholders are excited. The team estimates two weeks to production.

Six months later, the system is still unreliable. Users get wrong answers from outdated chunks. The model confidently cites documents that say the opposite of what the answer claims. Costs are 10x the original estimate. And nobody can figure out why the same question produces different answers depending on the time of day.

RAG is not a solution. RAG is a retrieval pattern. A reliable AI system needs an orchestration layer, quality controls, hybrid search, hallucination boundaries, cost management, and monitoring on top of RAG. This article covers what we've learned building production RAG systems.

For broader context on enterprise RAG architecture and vector search, those guides cover the foundational patterns. This article focuses on where those patterns break and what you need beyond them.

Where RAG Breaks

Failure ModeWhat HappensHow Common
Chunk qualityWrong chunk boundaries split context, answer is based on partial informationVery common
Stale dataIndex not updated, answer is based on outdated documentCommon
Retrieval missRelevant document exists but embedding similarity doesn't surface itCommon
Hallucination despite retrievalModel ignores retrieved context and generates from training dataCommon
Context window overflowToo many chunks retrieved, model loses focusModerate
Cross-document confusionChunks from different documents mixed, model blends contradictory factsModerate
Cost explosionEmbedding + retrieval + generation costs scale with query volumeGradual
Latency spikesVector search + reranking + generation takes too long for interactive useModerate

Chunk Quality Is Everything

The most underestimated problem. If your chunks split a paragraph in the middle of a thought, the retrieved context is incomplete. If your chunks are too large, irrelevant content dilutes the useful information. If your chunks don't preserve document structure (headings, tables, lists), the model loses the organizational context.

// Bad: fixed-size chunks break context
function naiveChunk(text: string, size: number): string[] {
    const chunks = [];
    for (let i = 0; i < text.length; i += size) {
        chunks.push(text.slice(i, i + size));
    }
    return chunks;
    // Problem: splits sentences, paragraphs, tables mid-content
}

// Better: semantic chunking with overlap
function semanticChunk(text: string, options: ChunkOptions): Chunk[] {
    const sections = splitByHeadings(text);      // Respect document structure
    const paragraphs = sections.flatMap(s =>
        splitByParagraphs(s, { maxSize: options.maxChunkSize })
    );

    return paragraphs.map((p, i) => ({
        content: p.text,
        metadata: {
            section: p.sectionTitle,
            pageNumber: p.pageNumber,
            documentId: p.documentId,
            position: i,
        },
        // Overlap: include last 2 sentences from previous chunk
        prefix: i > 0 ? getLastSentences(paragraphs[i - 1].text, 2) : '',
    }));
}

The overlap matters. Without it, a question that spans two chunks gets partial context from each and a complete answer from neither. With 2-3 sentence overlap, the model has enough context to bridge chunk boundaries.

Chunk metadata is equally critical. Every chunk must carry its source document ID, section title, page number, and position. Without metadata, you can't tell the user where the answer came from. Without source attribution, the answer is unverifiable.

Retrieval Quality vs Retrieval Quantity

Retrieving more chunks doesn't mean better answers. In practice, we've found that 3-5 high-quality chunks consistently outperform 10-15 mediocre chunks.

Chunks RetrievedAnswer QualityLatencyCost
1-2Risk of missing contextFastLow
3-5Best balance (recommended)ModerateModerate
5-10Diminishing returns, some noiseSlowerHigher
10+Context dilution, model confusedSlowHigh

The solution: retrieve broadly, then rerank aggressively.

async function retrieveAndRerank(query: string, options: RetrievalOptions) {
    // Step 1: Broad retrieval (get 20 candidates)
    const candidates = await vectorStore.search(query, { limit: 20 });

    // Step 2: Rerank with a cross-encoder (score each candidate against the query)
    const reranked = await reranker.rank(query, candidates, {
        model: 'cross-encoder/ms-marco-MiniLM-L-12-v2',
    });

    // Step 3: Take top 5 after reranking
    const topChunks = reranked.slice(0, 5);

    // Step 4: Filter by minimum relevance score
    return topChunks.filter(c => c.score > options.minRelevanceScore);
}

The reranker is a cross-encoder model that scores each candidate against the query with much higher accuracy than cosine similarity on embeddings. It's slower (runs inference per candidate), but the quality improvement is substantial. Running it on 20 candidates to select 5 adds 100-200ms of latency, which is acceptable for most use cases.

The Orchestration Layer RAG Needs

Raw RAG is: embed query, search vectors, stuff context into prompt, generate. A production system needs an orchestration layer between retrieval and generation.

User Query
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query Analysis   β”‚  Classify intent, extract entities, detect language
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Routing          β”‚  Which index? Which retrieval strategy? Cache hit?
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Retrieval        β”‚  Vector search + keyword search (hybrid)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reranking        β”‚  Cross-encoder scoring, filter low-relevance chunks
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Context Assembly β”‚  Order chunks, add metadata, respect token budget
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Generation       β”‚  LLM call with assembled context + system prompt
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Output Validationβ”‚  Check for hallucination, verify citations, PII scan
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    Response

Query Analysis

Not every query needs RAG. Some queries are conversational ("hello", "thanks"). Some are about the system itself ("how do I use this tool?"). Some are ambiguous and need clarification. The query analyzer classifies intent before triggering retrieval.

async function analyzeQuery(query: string): Promise<QueryAnalysis> {
    // Fast classification (can be a small model or rule-based)
    const intent = await classifyIntent(query);

    if (intent === 'greeting' || intent === 'meta') {
        return { needsRetrieval: false, intent, response: getStaticResponse(intent) };
    }

    if (intent === 'ambiguous') {
        return { needsRetrieval: false, intent, clarificationNeeded: true };
    }

    return {
        needsRetrieval: true,
        intent,
        extractedEntities: await extractEntities(query),
        detectedLanguage: await detectLanguage(query),
    };
}

Routing

Different queries might need different indices, different retrieval strategies, or different models.

Query TypeIndexStrategyModel
Product questionProducts indexHybrid (text + vector)Fast model (GPT-4o-mini)
Legal/compliance questionPolicies indexVector only (precise)Accurate model (GPT-4o)
Technical supportKnowledge base indexHybrid + rerankFast model
Multi-language queryMultilingual indexVector with language filterMultilingual model

Context Assembly

After retrieval and reranking, chunks must be assembled into a prompt that respects the model's token budget.

function assembleContext(chunks: RankedChunk[], tokenBudget: number): string {
    let context = '';
    let tokensUsed = 0;

    for (const chunk of chunks) {
        const chunkTokens = estimateTokens(chunk.content);
        if (tokensUsed + chunkTokens > tokenBudget) break;

        context += `\n\n---\nSource: ${chunk.metadata.documentTitle} (${chunk.metadata.section})\n`;
        context += chunk.content;
        tokensUsed += chunkTokens;
    }

    return context;
}

The token budget must account for the system prompt, the user query, the assembled context, AND the expected response length. A common mistake is filling the entire context window with retrieved chunks, leaving no room for a quality response.

Hybrid Search: Text + Vector

Pure vector search misses keyword-specific queries. A user searching for "error code E-4021" will get poor results from embedding similarity because error codes are not semantically meaningful. Pure text search misses semantic queries. A user searching for "how to fix login problems" won't find a document titled "Authentication Troubleshooting Guide."

Hybrid search combines both:

async function hybridSearch(query: string, options: SearchOptions) {
    // Parallel execution
    const [vectorResults, textResults] = await Promise.all([
        vectorStore.search(query, { limit: options.vectorLimit }),
        textIndex.search(query, { limit: options.textLimit }),
    ]);

    // Reciprocal Rank Fusion (RRF) to merge results
    const merged = reciprocalRankFusion(vectorResults, textResults, {
        vectorWeight: 0.6,
        textWeight: 0.4,
    });

    return merged.slice(0, options.totalLimit);
}

function reciprocalRankFusion(
    vectorResults: SearchResult[],
    textResults: SearchResult[],
    weights: { vectorWeight: number; textWeight: number },
): SearchResult[] {
    const scores = new Map<string, number>();
    const k = 60; // RRF constant

    vectorResults.forEach((result, rank) => {
        const score = (scores.get(result.id) || 0) + weights.vectorWeight / (k + rank + 1);
        scores.set(result.id, score);
    });

    textResults.forEach((result, rank) => {
        const score = (scores.get(result.id) || 0) + weights.textWeight / (k + rank + 1);
        scores.set(result.id, score);
    });

    return Array.from(scores.entries())
        .sort(([, a], [, b]) => b - a)
        .map(([id, score]) => ({ id, score }));
}

The weight ratio (vector 0.6, text 0.4) is a starting point. Tune it based on your query distribution. If most queries are keyword-heavy (product SKUs, error codes), increase text weight. If most queries are natural language, increase vector weight.

For more on search architecture in commerce contexts, see our ecommerce platforms guide.

Hallucination Boundaries

RAG reduces hallucination compared to pure LLM generation. It does not eliminate it. The model can still:

  • Ignore retrieved context and generate from training data
  • Blend information from multiple chunks incorrectly
  • Invent citations that don't exist in the retrieved context
  • Extrapolate beyond what the context supports

Mitigation Strategies

1. Constrained system prompts:

You are a support assistant. Answer ONLY based on the provided context.
If the context does not contain enough information to answer, say
"I don't have enough information to answer that question."
Do NOT use information from your training data.
Every claim must reference a specific source from the context.

2. Citation verification:

async function verifyCitations(response: string, chunks: RankedChunk[]): VerificationResult {
    const citations = extractCitations(response);
    const verified = [];
    const unverified = [];

    for (const citation of citations) {
        const found = chunks.some(chunk =>
            chunk.content.includes(citation.claimedText) ||
            fuzzyMatch(chunk.content, citation.claimedText, 0.85)
        );
        (found ? verified : unverified).push(citation);
    }

    return {
        allVerified: unverified.length === 0,
        verified,
        unverified,
        confidenceScore: verified.length / (verified.length + unverified.length),
    };
}

3. Confidence scoring:

If the model's response doesn't align well with the retrieved context (low overlap, no direct quotes), flag it as low confidence. Show a warning to the user or escalate to a human.

For more on AI failure modes and how to handle them, see our AI failure modes guide.

Cost and Latency

RAG costs scale with query volume across three dimensions:

ComponentCost DriverTypical Range
Embedding queryPer query (model inference)$0.0001 per query
Vector searchPer query (compute + I/O)$0.0005 per query
RerankingPer query * candidates (model inference)$0.001 per query
LLM generationInput tokens (context) + output tokens$0.01-0.10 per query
Embedding documentsOne-time per document (on ingest)$0.0001 per page

LLM generation dominates cost. Reducing context size (fewer chunks, shorter chunks) directly reduces the most expensive component.

Caching Strategies

// Semantic cache: cache responses for similar queries
async function cachedQuery(query: string): Promise<string | null> {
    // Embed the query
    const queryEmbedding = await embedder.embed(query);

    // Search cache index for similar queries
    const cached = await cacheIndex.search(queryEmbedding, {
        minSimilarity: 0.95,  // High threshold for cache hits
        limit: 1,
    });

    if (cached.length > 0) {
        return cached[0].response;  // Cache hit
    }

    return null;  // Cache miss, proceed with full RAG pipeline
}

Semantic caching works because many users ask similar questions in slightly different ways. "How do I reset my password?" and "Password reset instructions" are different strings but semantically identical. A 0.95 similarity threshold ensures only near-identical queries get cached responses.

Latency Budget

For interactive use (chatbot, support assistant), the total pipeline must complete in under 3 seconds:

StageBudgetOptimization
Query analysis50msRule-based or small model
Cache check30msIn-memory vector index
Vector search100msDedicated search cluster
Text search100msParallel with vector search
Reranking200msSmall cross-encoder, limit candidates
Context assembly10msIn-memory
LLM generation1,500msStreaming, fast model
Output validation100msRule-based + small model
Total~2,100ms

Streaming the LLM response to the user while generation is in progress makes the perceived latency much lower. The user sees the first tokens in 300-500ms even though the full response takes 1,500ms.

When to Skip RAG Entirely

RAG is not always the right pattern. Sometimes simpler approaches work better:

ScenarioBetter ApproachWhy
Static FAQ (< 50 questions)Keyword match + template responseFaster, cheaper, deterministic
Structured data queriesSQL/API query + templateLLM adds latency and hallucination risk
Real-time data (stock prices, inventory)Direct API callEmbeddings are stale by definition
Simple classificationFine-tuned classifierCheaper, faster, more reliable
Document summarizationDirect LLM call (no retrieval)The full document IS the context

RAG makes sense when you have a large knowledge base (hundreds to thousands of documents), natural language queries that can't be keyword-matched, and a need for synthesized answers from multiple sources. If your use case doesn't match this profile, a simpler approach will be more reliable and cheaper.

For how we approach these architecture decisions in our AI services practice, and for broader patterns in AI workflow design, those pages provide more context.

Common Pitfalls

  1. Fixed-size chunking. Chunks that split paragraphs, tables, or code blocks mid-content produce garbage retrieval. Use semantic chunking that respects document structure.

  2. No chunk overlap. Without overlap, queries that span chunk boundaries get partial context from each and a complete answer from neither.

  3. No reranking. Embedding similarity is a rough filter. A cross-encoder reranker dramatically improves the quality of the top-5 results.

  4. Filling the entire context window. Leave room for the system prompt, user query, AND the expected response. A prompt stuffed with 15 chunks leaves no room for a quality answer.

  5. No hybrid search. Pure vector search fails on keyword queries (error codes, product SKUs). Pure text search fails on semantic queries. Use both.

  6. No semantic caching. Similar questions asked by different users trigger the full RAG pipeline every time. A semantic cache with 0.95 similarity threshold reduces costs significantly.

  7. Trusting RAG output without verification. RAG reduces hallucination. It doesn't eliminate it. Verify citations against the retrieved context. Flag unverified claims.

  8. No monitoring. You need to track retrieval quality (were the right chunks retrieved?), answer quality (did the user find the answer helpful?), latency, cost per query, and cache hit rate.

Key Takeaways

  • RAG is a retrieval pattern, not a solution. A reliable system needs query analysis, routing, hybrid search, reranking, context assembly, generation, and output validation on top of retrieval.

  • Chunk quality determines answer quality. Semantic chunking with overlap, metadata, and document structure preservation is the foundation. Everything else is built on top of good chunks.

  • Retrieve broadly, rerank aggressively. Get 20 candidates with embedding similarity. Score them with a cross-encoder. Take the top 5. Filter by minimum relevance.

  • Hybrid search handles what vectors miss. Keywords, error codes, product IDs, and exact matches need text search. Semantic queries need vector search. Use both with reciprocal rank fusion.

  • LLM generation dominates cost. Reducing context size (fewer, better chunks) is the most effective cost optimization.

  • Sometimes RAG is the wrong pattern. Static FAQs, structured data queries, real-time data, and simple classification all have simpler, more reliable solutions.

We build production RAG systems as part of our AI services and data engineering practice. If you're building a RAG system or debugging one that's unreliable, talk to our team or request a quote. You can also explore our methodology page for how we approach AI projects.

Topics covered

RAG productionRAG limitationsAI reliabilityretrieval augmented generation problemsRAG architecturehybrid searchRAG orchestrationRAG hallucination

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation