Technical Guide

Enterprise RAG Systems: A Technical Deep Dive

A comprehensive technical guide to building production-ready Retrieval-Augmented Generation systems at scale. Learn document ingestion pipelines, chunking strategies, embedding models, retrieval optimization, reranking, and hybrid search from engineers who've deployed RAG in production.

February 21, 202619 min readOronts Engineering Team

Why RAG? The Problem We're Actually Solving

Let me be direct: LLMs are powerful but they have a fundamental problem. They only know what they were trained on, and that knowledge has a cutoff date. Ask GPT-4 about your company's Q3 earnings or your internal API documentation, and you'll get a polite "I don't have information about that" or worse, a confident hallucination.

RAG solves this by giving the model access to your data at inference time. Instead of hoping the model memorized the right information, you retrieve relevant documents and feed them directly into the prompt. Simple concept, but the devil is in the implementation details.

We've built RAG systems that handle millions of documents across dozens of enterprise deployments. Here's what we've learned about making them work at scale.

RAG isn't just about adding documents to a prompt. It's about building a retrieval system that consistently finds the right information, even when users ask questions in unexpected ways.

The RAG Pipeline: End-to-End Architecture

Before diving into components, let's understand how everything fits together. A production RAG system has two main phases:

Ingestion Phase (Offline)

Documents → Preprocessing → Chunking → Embedding → Vector Storage

Query Phase (Online)

User Query → Query Processing → Retrieval → Reranking → LLM Generation
PhaseWhen It RunsLatency RequirementsPrimary Goal
IngestionBatch/ScheduledMinutes to hours acceptableMaximize recall potential
QueryReal-timeSub-secondPrecision + Speed

The ingestion phase is where you prepare your knowledge base. The query phase is where you actually answer questions. Both need to be optimized, but they have very different constraints.

Document Ingestion: Getting Your Data RAG-Ready

Source Connectors: Where Your Data Lives

Enterprise data is scattered everywhere. We've built connectors for:

Source TypeExamplesChallenges
Document StorageSharePoint, Google Drive, S3Access control, incremental sync
DatabasesPostgreSQL, MongoDB, SnowflakeSchema mapping, query complexity
SaaS PlatformsSalesforce, Zendesk, ConfluenceAPI rate limits, pagination
CommunicationSlack, Teams, EmailPrivacy, threading context
Code RepositoriesGitHub, GitLabFile relationships, version history

The key insight: don't just dump everything into your vector store. Build smart connectors that:

  1. Respect access controls - If a user can't access a document in SharePoint, they shouldn't retrieve it via RAG
  2. Handle incremental updates - Re-processing millions of documents because one changed is wasteful
  3. Preserve metadata - Document creation date, author, and source are crucial for filtering and attribution
// Example: Smart document sync with change detection
const syncDocuments = async (source) => {
  const lastSync = await db.getLastSyncTime(source.id);
  const changes = await source.getChangesSince(lastSync);

  for (const doc of changes.modified) {
    const chunks = await processDocument(doc);
    await vectorStore.upsert(chunks, {
      sourceId: source.id,
      documentId: doc.id,
      permissions: doc.accessControl
    });
  }

  for (const docId of changes.deleted) {
    await vectorStore.deleteByDocumentId(docId);
  }
};

Document Processing: Handling Real-World Formats

PDFs are the bane of every RAG engineer's existence. They look simple but contain nightmares: multi-column layouts, embedded tables, scanned images, headers and footers that repeat on every page.

Here's our processing hierarchy:

Document TypeProcessing ApproachQuality Notes
Markdown/Plain TextDirect extractionExcellent quality
HTML/Web PagesDOM parsing + cleaningGood, watch for boilerplate
Word Documentspython-docx or similarGood, preserve structure
PDFs (digital)PyMuPDF + layout analysisVaries wildly
PDFs (scanned)OCR + layout analysisLower quality, verify accuracy
SpreadsheetsCell-aware extractionRequires semantic understanding
Images/DiagramsVision models + OCREmerging capability

For PDFs specifically, we've found that layout-aware extraction makes a huge difference:

# Bad: Simple text extraction loses structure
text = pdf_page.get_text()  # "Revenue Q1 Q2 Q3 1000 1200 1500"

# Better: Layout-aware extraction preserves tables
blocks = pdf_page.get_text("dict")["blocks"]
tables = identify_tables(blocks)
# Results in structured data you can actually use

Chunking Strategies: The Heart of Good Retrieval

This is where most RAG implementations fail. Bad chunking leads to poor retrieval, and no amount of fancy reranking can fix fundamentally broken chunks.

Why Chunk Size Matters

Chunks that are too small lack context. Chunks that are too large dilute relevance and waste precious context window space.

Chunk SizeProsConsBest For
Small (100-200 tokens)High precisionLoses contextFAQ, definitions
Medium (300-500 tokens)BalancedJack of all tradesGeneral knowledge bases
Large (500-1000 tokens)Rich contextLower precision, expensiveTechnical documentation

Chunking Approaches We Actually Use

1. Recursive Character Splitting (Baseline)

The simplest approach: split on paragraphs, then sentences, then characters if needed. Works surprisingly well for homogeneous documents.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

2. Semantic Chunking (Better for Diverse Content)

Instead of fixed sizes, detect topic shifts using embeddings. When the semantic similarity between consecutive sentences drops significantly, start a new chunk.

def semantic_chunking(sentences, embedding_model, threshold=0.5):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            embedding_model.encode(sentences[i-1]),
            embedding_model.encode(sentences[i])
        )

        if similarity < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    return chunks

3. Document-Structure-Aware Chunking (Best for Technical Docs)

Use document structure: headers, sections, code blocks. A function definition should stay together. A section with its subsections forms a natural unit.

Document ElementChunking Strategy
Headers (H1, H2)Use as chunk boundaries
Code blocksKeep intact, include surrounding context
TablesExtract as structured data + text description
ListsKeep with preceding context
ParagraphsRespect as minimum units

The Overlap Strategy

Overlap between chunks helps preserve context across boundaries. We typically use 10-20% overlap:

Chunk 1: [-------- content --------][overlap]
Chunk 2:                     [overlap][-------- content --------]

But overlap isn't free - it increases storage and can cause duplicate retrievals. For large corpora, we use sliding window with deduplication at query time.

Embedding Models: Converting Text to Vectors

Your embedding model determines how well semantic similarity maps to actual relevance. Choose wrong, and queries won't find matching documents even when they exist.

Model Comparison

ModelDimensionsStrengthsWeaknessesCost
OpenAI text-embedding-3-large3072Excellent quality, multilingualAPI dependency, cost at scale~$0.13/1M tokens
OpenAI text-embedding-3-small1536Good quality, fasterSlightly lower quality~$0.02/1M tokens
Cohere embed-v31024Strong multilingualAPI dependency~$0.10/1M tokens
BGE-large-en-v1.51024Self-hosted, fastEnglish-focusedSelf-hosted
E5-mistral-7b-instruct4096State-of-the-art qualityHeavy, slowSelf-hosted
GTE-Qwen2-7B-instruct3584Excellent qualityResource intensiveSelf-hosted

When to Fine-tune Your Embedding Model

Off-the-shelf models work well for general content. But for domain-specific vocabularies - legal, medical, technical - fine-tuning can improve retrieval by 15-30%.

Signs you need fine-tuning:

  • Industry-specific terminology isn't matching well
  • Acronyms in your domain have different meanings than common usage
  • Your documents have unique structural patterns
# Fine-tuning with sentence-transformers
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Prepare training pairs from your domain
train_examples = [
    InputExample(texts=["user query", "relevant document"]),
    # ... more examples
]

train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3)

Embedding Best Practices

Batch Processing: Never embed one document at a time in production. Batch for throughput.

# Bad: O(n) API calls
for doc in documents:
    embedding = model.encode(doc)

# Good: O(1) API call
embeddings = model.encode(documents, batch_size=32)

Normalize Vectors: Most similarity searches assume normalized vectors. Ensure your embeddings are L2-normalized.

Cache Aggressively: Embedding the same query twice is pure waste. Use a query cache with TTL.

Vector Databases: Storing and Searching at Scale

Your vector database handles the heavy lifting of similarity search. The choice matters enormously at scale.

Comparison Matrix

DatabaseTypeMax ScaleFilteringStrengths
PineconeManaged1B+ vectorsExcellentEasy to start, auto-scaling
WeaviateSelf-hosted/Cloud100M+GoodGraphQL API, hybrid search
QdrantSelf-hosted/Cloud100M+ExcellentPerformance, Rust-based
MilvusSelf-hosted1B+GoodScale, GPU support
pgvectorPostgreSQL extension10MBasicSimplicity, existing infra
ChromaEmbedded1MBasicDevelopment, prototyping

Indexing Strategies

The index type dramatically affects query performance and recall:

Index TypeBuild TimeQuery TimeRecallMemory
Flat (brute force)O(1)O(n)100%Low
IVFMediumFast95-99%Medium
HNSWSlowVery fast98-99%High
PQ (Product Quantization)FastFast90-95%Very low

For most production systems, HNSW provides the best balance. But at billions of vectors, you'll likely need IVF-PQ with careful tuning.

# Qdrant HNSW configuration example
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config={
        "m": 16,           # Connections per node
        "ef_construct": 100  # Construction accuracy
    }
)

Retrieval Optimization: Finding the Right Documents

Query Transformation

Users don't ask questions the way documents are written. Query transformation bridges this gap.

TechniqueHow It WorksWhen to Use
Query expansionAdd synonyms and related termsTechnical domains with varied terminology
HyDE (Hypothetical Document Embeddings)Generate a hypothetical answer, embed thatWhen queries are very different from documents
Query decompositionBreak complex queries into sub-queriesMulti-part questions
Query rewritingLLM rewrites query for better retrievalConversational/ambiguous queries
# HyDE implementation
def hyde_retrieval(query, llm, retriever):
    # Generate hypothetical answer
    hypothetical = llm.generate(
        f"Write a short passage that would answer: {query}"
    )

    # Search using the hypothetical document
    results = retriever.search(hypothetical)
    return results

Hybrid Search: Combining Vector + Keyword

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid combines both.

ApproachVector WeightKeyword WeightBest For
Vector-first0.80.2General knowledge
Balanced0.50.5Mixed content
Keyword-first0.20.8Technical with exact terms
Reciprocal Rank FusionDynamicDynamicUnknown query distribution
def hybrid_search(query, vector_store, keyword_index, alpha=0.7):
    # Vector search
    vector_results = vector_store.search(query, k=20)

    # BM25 keyword search
    keyword_results = keyword_index.search(query, k=20)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + alpha / (k + rank)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1-alpha) / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking: Precision When It Matters

Initial retrieval casts a wide net. Reranking uses a more expensive model to precisely order the top candidates.

Reranking Models

ModelApproachLatencyQuality
Cohere RerankCross-encoder API~100msExcellent
BGE-reranker-largeSelf-hosted cross-encoder~50msVery good
ColBERTLate interaction~30msGood
LLM-based rerankingPrompt-based scoring~500msExcellent but slow

When to Rerank

Reranking adds latency. Use it strategically:

def smart_retrieval(query, top_k=5):
    # Fast initial retrieval
    candidates = vector_search(query, k=100)

    # Rerank only if needed
    if needs_precision(query):
        candidates = reranker.rerank(query, candidates)

    return candidates[:top_k]

def needs_precision(query):
    # Rerank for specific, fact-seeking queries
    # Skip for broad, exploratory queries
    return query_classifier.predict(query) == "factual"

Production Considerations

Monitoring and Observability

You can't improve what you don't measure. Track these metrics:

MetricWhat It Tells YouTarget
Retrieval latency (p50, p99)User experience<200ms p99
Recall@kAre relevant docs in results?>95%
MRR (Mean Reciprocal Rank)Is the right doc near the top?>0.7
LLM attribution rateIs LLM using retrieved context?>80%
User feedback (thumbs up/down)End-to-end quality>90% positive

Caching Strategies

RAG involves expensive operations. Cache aggressively:

ComponentCache StrategyTTL
Query embeddingsLRU with semantic dedup1 hour
Search resultsQuery hash → results15 min
Document chunksPermanent until doc changes-
LLM responsesQuery + context hash5 min

Handling Updates

Your knowledge base isn't static. Handle updates without rebuilding everything:

  1. Incremental indexing: Update only changed documents
  2. Version control: Track document versions, support rollback
  3. Cache invalidation: Bust caches when source documents change
  4. Consistency checks: Periodically verify vector store matches source of truth

Common Pitfalls and How to Avoid Them

PitfallSymptomSolution
Chunking too smallRetrieved chunks lack contextIncrease size, add overlap
Chunking too largeIrrelevant content retrievedDecrease size, use structure
Ignoring metadataCan't filter by date/sourceStore and index metadata
Single retrieval strategyWorks for some queries, fails for othersImplement hybrid search
No rerankingTop result often wrongAdd cross-encoder reranker
Embedding model mismatchTechnical terms don't matchFine-tune or use domain model
Ignoring document structureTables, code blocks garbledStructure-aware processing

Real-World Performance Numbers

From our production deployments:

MetricBefore OptimizationAfter Optimization
Query latency (p50)850ms180ms
Query latency (p99)2.5s450ms
Retrieval accuracy72%94%
User satisfaction68%91%
Cost per query$0.08$0.03

The biggest wins came from:

  1. Proper chunking strategy (not too small, not too large)
  2. Hybrid search with tuned weights
  3. Aggressive caching at multiple layers
  4. Reranking for precision-critical queries

Getting Started

If you're building your first RAG system:

  1. Start simple: Use a managed vector database, off-the-shelf embedding model, basic chunking
  2. Measure everything: Set up monitoring from day one
  3. Build a test set: Create query-document pairs to measure retrieval quality
  4. Iterate based on data: Don't over-engineer; optimize what measurements show is broken

If you're scaling an existing RAG system:

  1. Profile your pipeline: Find the actual bottlenecks
  2. Consider hybrid search: Pure vector often isn't enough
  3. Add reranking: It's often the highest-ROI optimization
  4. Invest in chunking: This is where most quality issues originate

RAG is not a solved problem. It's a set of trade-offs between latency, accuracy, and cost. The best systems are the ones that make these trade-offs consciously and measure the results.

We've helped dozens of organizations build RAG systems that actually work in production. If you're struggling with retrieval quality or scaling challenges, we'd be happy to share what we've learned.

Topics covered

RAGretrieval augmented generationvector databasesembedding modelsdocument chunkinghybrid searchrerankingenterprise AIsemantic searchknowledge retrieval

Ready to implement agentic AI?

Our team specializes in building production-ready AI systems. Let's discuss how we can help you leverage agentic AI for your enterprise.

Start a conversation