Technical Guide

Enterprise RAG Systems: A Technical Deep Dive

A comprehensive technical guide to building production-ready Retrieval-Augmented Generation systems at scale. Learn document ingestion pipelines, chunking strategies, embedding models, retrieval optimization, reranking, and hybrid search from engineers who've deployed RAG in production.

February 21, 202619 min readOronts Engineering Team

Why RAG? The Problem We're Actually Solving

Let me be direct: LLMs are powerful but they have a fundamental problem. They only know what they were trained on, and that knowledge has a cutoff date. Ask GPT-4 about your company's Q3 earnings or your internal API documentation, and you'll get a polite "I don't have information about that" or worse, a confident hallucination.

RAG solves this by giving the model access to your data at inference time. Instead of hoping the model memorized the right information, you retrieve relevant documents and feed them directly into the prompt. Simple concept, but the devil is in the implementation details.

We've built RAG systems that handle millions of documents across dozens of enterprise deployments. Here's what we've learned about making them work at scale.

RAG isn't just about adding documents to a prompt. It's about building a retrieval system that consistently finds the right information, even when users ask questions in unexpected ways.

The RAG Pipeline: End-to-End Architecture

Before diving into components, let's understand how everything fits together. A production RAG system has two main phases:

Ingestion Phase (Offline)

Documents → Preprocessing → Chunking → Embedding → Vector Storage

Query Phase (Online)

User Query → Query Processing → Retrieval → Reranking → LLM Generation

Phase	When It Runs	Latency Requirements	Primary Goal
Ingestion	Batch/Scheduled	Minutes to hours acceptable	Maximize recall potential
Query	Real-time	Sub-second	Precision + Speed

The ingestion phase is where you prepare your knowledge base. The query phase is where you actually answer questions. Both need to be optimized, but they have very different constraints.

Document Ingestion: Getting Your Data RAG-Ready

Source Connectors: Where Your Data Lives

Enterprise data is scattered everywhere. We've built connectors for:

Source Type	Examples	Challenges
Document Storage	SharePoint, Google Drive, S3	Access control, incremental sync
Databases	PostgreSQL, MongoDB, Snowflake	Schema mapping, query complexity
SaaS Platforms	Salesforce, Zendesk, Confluence	API rate limits, pagination
Communication	Slack, Teams, Email	Privacy, threading context
Code Repositories	GitHub, GitLab	File relationships, version history

The key insight: don't just dump everything into your vector store. Build smart connectors that:

Respect access controls - If a user can't access a document in SharePoint, they shouldn't retrieve it via RAG
Handle incremental updates - Re-processing millions of documents because one changed is wasteful
Preserve metadata - Document creation date, author, and source are crucial for filtering and attribution

// Example: Smart document sync with change detection
const syncDocuments = async (source) => {
  const lastSync = await db.getLastSyncTime(source.id);
  const changes = await source.getChangesSince(lastSync);

  for (const doc of changes.modified) {
    const chunks = await processDocument(doc);
    await vectorStore.upsert(chunks, {
      sourceId: source.id,
      documentId: doc.id,
      permissions: doc.accessControl
    });
  }

  for (const docId of changes.deleted) {
    await vectorStore.deleteByDocumentId(docId);
  }
};

Document Processing: Handling Real-World Formats

PDFs are the bane of every RAG engineer's existence. They look simple but contain nightmares: multi-column layouts, embedded tables, scanned images, headers and footers that repeat on every page.

Here's our processing hierarchy:

Document Type	Processing Approach	Quality Notes
Markdown/Plain Text	Direct extraction	Excellent quality
HTML/Web Pages	DOM parsing + cleaning	Good, watch for boilerplate
Word Documents	python-docx or similar	Good, preserve structure
PDFs (digital)	PyMuPDF + layout analysis	Varies wildly
PDFs (scanned)	OCR + layout analysis	Lower quality, verify accuracy
Spreadsheets	Cell-aware extraction	Requires semantic understanding
Images/Diagrams	Vision models + OCR	Emerging capability

For PDFs specifically, we've found that layout-aware extraction makes a huge difference:

# Bad: Simple text extraction loses structure
text = pdf_page.get_text()  # "Revenue Q1 Q2 Q3 1000 1200 1500"

# Better: Layout-aware extraction preserves tables
blocks = pdf_page.get_text("dict")["blocks"]
tables = identify_tables(blocks)
# Results in structured data you can actually use

Chunking Strategies: The Heart of Good Retrieval

This is where most RAG implementations fail. Bad chunking leads to poor retrieval, and no amount of fancy reranking can fix fundamentally broken chunks.

Why Chunk Size Matters

Chunks that are too small lack context. Chunks that are too large dilute relevance and waste precious context window space.

Chunk Size	Pros	Cons	Best For
Small (100-200 tokens)	High precision	Loses context	FAQ, definitions
Medium (300-500 tokens)	Balanced	Jack of all trades	General knowledge bases
Large (500-1000 tokens)	Rich context	Lower precision, expensive	Technical documentation

Chunking Approaches We Actually Use

1. Recursive Character Splitting (Baseline)

The simplest approach: split on paragraphs, then sentences, then characters if needed. Works surprisingly well for homogeneous documents.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

2. Semantic Chunking (Better for Diverse Content)

Instead of fixed sizes, detect topic shifts using embeddings. When the semantic similarity between consecutive sentences drops significantly, start a new chunk.

def semantic_chunking(sentences, embedding_model, threshold=0.5):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            embedding_model.encode(sentences[i-1]),
            embedding_model.encode(sentences[i])
        )

        if similarity < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    return chunks

3. Document-Structure-Aware Chunking (Best for Technical Docs)

Use document structure: headers, sections, code blocks. A function definition should stay together. A section with its subsections forms a natural unit.

Document Element	Chunking Strategy
Headers (H1, H2)	Use as chunk boundaries
Code blocks	Keep intact, include surrounding context
Tables	Extract as structured data + text description
Lists	Keep with preceding context
Paragraphs	Respect as minimum units

The Overlap Strategy

Overlap between chunks helps preserve context across boundaries. We typically use 10-20% overlap:

Chunk 1: [-------- content --------][overlap]
Chunk 2:                     [overlap][-------- content --------]

But overlap isn't free - it increases storage and can cause duplicate retrievals. For large corpora, we use sliding window with deduplication at query time.

Embedding Models: Converting Text to Vectors

Your embedding model determines how well semantic similarity maps to actual relevance. Choose wrong, and queries won't find matching documents even when they exist.

Model Comparison

Model	Dimensions	Strengths	Weaknesses	Cost
OpenAI text-embedding-3-large	3072	Excellent quality, multilingual	API dependency, cost at scale	~$0.13/1M tokens
OpenAI text-embedding-3-small	1536	Good quality, faster	Slightly lower quality	~$0.02/1M tokens
Cohere embed-v3	1024	Strong multilingual	API dependency	~$0.10/1M tokens
BGE-large-en-v1.5	1024	Self-hosted, fast	English-focused	Self-hosted
E5-mistral-7b-instruct	4096	State-of-the-art quality	Heavy, slow	Self-hosted
GTE-Qwen2-7B-instruct	3584	Excellent quality	Resource intensive	Self-hosted

When to Fine-tune Your Embedding Model

Off-the-shelf models work well for general content. But for domain-specific vocabularies - legal, medical, technical - fine-tuning can improve retrieval by 15-30%.

Signs you need fine-tuning:

Industry-specific terminology isn't matching well
Acronyms in your domain have different meanings than common usage
Your documents have unique structural patterns

# Fine-tuning with sentence-transformers
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Prepare training pairs from your domain
train_examples = [
    InputExample(texts=["user query", "relevant document"]),
    # ... more examples
]

train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3)

Embedding Best Practices

Batch Processing: Never embed one document at a time in production. Batch for throughput.

# Bad: O(n) API calls
for doc in documents:
    embedding = model.encode(doc)

# Good: O(1) API call
embeddings = model.encode(documents, batch_size=32)

Normalize Vectors: Most similarity searches assume normalized vectors. Ensure your embeddings are L2-normalized.

Cache Aggressively: Embedding the same query twice is pure waste. Use a query cache with TTL.

Vector Databases: Storing and Searching at Scale

Your vector database handles the heavy lifting of similarity search. The choice matters enormously at scale.

Comparison Matrix

Database	Type	Max Scale	Filtering	Strengths
Pinecone	Managed	1B+ vectors	Excellent	Easy to start, auto-scaling
Weaviate	Self-hosted/Cloud	100M+	Good	GraphQL API, hybrid search
Qdrant	Self-hosted/Cloud	100M+	Excellent	Performance, Rust-based
Milvus	Self-hosted	1B+	Good	Scale, GPU support
pgvector	PostgreSQL extension	10M	Basic	Simplicity, existing infra
Chroma	Embedded	1M	Basic	Development, prototyping

Indexing Strategies

The index type dramatically affects query performance and recall:

Index Type	Build Time	Query Time	Recall	Memory
Flat (brute force)	O(1)	O(n)	100%	Low
IVF	Medium	Fast	95-99%	Medium
HNSW	Slow	Very fast	98-99%	High
PQ (Product Quantization)	Fast	Fast	90-95%	Very low

For most production systems, HNSW provides the best balance. But at billions of vectors, you'll likely need IVF-PQ with careful tuning.

# Qdrant HNSW configuration example
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config={
        "m": 16,           # Connections per node
        "ef_construct": 100  # Construction accuracy
    }
)

Retrieval Optimization: Finding the Right Documents

Query Transformation

Users don't ask questions the way documents are written. Query transformation bridges this gap.

Technique	How It Works	When to Use
Query expansion	Add synonyms and related terms	Technical domains with varied terminology
HyDE (Hypothetical Document Embeddings)	Generate a hypothetical answer, embed that	When queries are very different from documents
Query decomposition	Break complex queries into sub-queries	Multi-part questions
Query rewriting	LLM rewrites query for better retrieval	Conversational/ambiguous queries

# HyDE implementation
def hyde_retrieval(query, llm, retriever):
    # Generate hypothetical answer
    hypothetical = llm.generate(
        f"Write a short passage that would answer: {query}"
    )

    # Search using the hypothetical document
    results = retriever.search(hypothetical)
    return results

Hybrid Search: Combining Vector + Keyword

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid combines both.

Approach	Vector Weight	Keyword Weight	Best For
Vector-first	0.8	0.2	General knowledge
Balanced	0.5	0.5	Mixed content
Keyword-first	0.2	0.8	Technical with exact terms
Reciprocal Rank Fusion	Dynamic	Dynamic	Unknown query distribution

def hybrid_search(query, vector_store, keyword_index, alpha=0.7):
    # Vector search
    vector_results = vector_store.search(query, k=20)

    # BM25 keyword search
    keyword_results = keyword_index.search(query, k=20)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + alpha / (k + rank)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1-alpha) / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking: Precision When It Matters

Initial retrieval casts a wide net. Reranking uses a more expensive model to precisely order the top candidates.

Reranking Models

Model	Approach	Latency	Quality
Cohere Rerank	Cross-encoder API	~100ms	Excellent
BGE-reranker-large	Self-hosted cross-encoder	~50ms	Very good
ColBERT	Late interaction	~30ms	Good
LLM-based reranking	Prompt-based scoring	~500ms	Excellent but slow

When to Rerank

Reranking adds latency. Use it strategically:

def smart_retrieval(query, top_k=5):
    # Fast initial retrieval
    candidates = vector_search(query, k=100)

    # Rerank only if needed
    if needs_precision(query):
        candidates = reranker.rerank(query, candidates)

    return candidates[:top_k]

def needs_precision(query):
    # Rerank for specific, fact-seeking queries
    # Skip for broad, exploratory queries
    return query_classifier.predict(query) == "factual"

Production Considerations

Monitoring and Observability

You can't improve what you don't measure. Track these metrics:

Metric	What It Tells You	Target
Retrieval latency (p50, p99)	User experience	<200ms p99
Recall@k	Are relevant docs in results?	>95%
MRR (Mean Reciprocal Rank)	Is the right doc near the top?	>0.7
LLM attribution rate	Is LLM using retrieved context?	>80%
User feedback (thumbs up/down)	End-to-end quality	>90% positive

Caching Strategies

RAG involves expensive operations. Cache aggressively:

Component	Cache Strategy	TTL
Query embeddings	LRU with semantic dedup	1 hour
Search results	Query hash → results	15 min
Document chunks	Permanent until doc changes	-
LLM responses	Query + context hash	5 min

Handling Updates

Your knowledge base isn't static. Handle updates without rebuilding everything:

Incremental indexing: Update only changed documents
Version control: Track document versions, support rollback
Cache invalidation: Bust caches when source documents change
Consistency checks: Periodically verify vector store matches source of truth

Common Pitfalls and How to Avoid Them

Pitfall	Symptom	Solution
Chunking too small	Retrieved chunks lack context	Increase size, add overlap
Chunking too large	Irrelevant content retrieved	Decrease size, use structure
Ignoring metadata	Can't filter by date/source	Store and index metadata
Single retrieval strategy	Works for some queries, fails for others	Implement hybrid search
No reranking	Top result often wrong	Add cross-encoder reranker
Embedding model mismatch	Technical terms don't match	Fine-tune or use domain model
Ignoring document structure	Tables, code blocks garbled	Structure-aware processing

Real-World Performance Numbers

From our production deployments:

Metric	Before Optimization	After Optimization
Query latency (p50)	850ms	180ms
Query latency (p99)	2.5s	450ms
Retrieval accuracy	72%	94%
User satisfaction	68%	91%
Cost per query	$0.08	$0.03

The biggest wins came from:

Proper chunking strategy (not too small, not too large)
Hybrid search with tuned weights
Aggressive caching at multiple layers
Reranking for precision-critical queries

Getting Started

If you're building your first RAG system:

Start simple: Use a managed vector database, off-the-shelf embedding model, basic chunking
Measure everything: Set up monitoring from day one
Build a test set: Create query-document pairs to measure retrieval quality
Iterate based on data: Don't over-engineer; optimize what measurements show is broken

If you're scaling an existing RAG system:

Profile your pipeline: Find the actual bottlenecks
Consider hybrid search: Pure vector often isn't enough
Add reranking: It's often the highest-ROI optimization
Invest in chunking: This is where most quality issues originate

RAG is not a solved problem. It's a set of trade-offs between latency, accuracy, and cost. The best systems are the ones that make these trade-offs consciously and measure the results.

We've helped dozens of organizations build RAG systems that actually work in production. If you're struggling with retrieval quality or scaling challenges, we'd be happy to share what we've learned.

Topics covered

RAGretrieval augmented generationvector databasesembedding modelsdocument chunkinghybrid searchrerankingenterprise AIsemantic searchknowledge retrieval

Ready to implement agentic AI?

Our team specializes in building production-ready AI systems. Let's discuss how we can help you leverage agentic AI for your enterprise.

Start a conversation