Enterprise RAG Systems: A Technical Deep Dive
A comprehensive technical guide to building production-ready Retrieval-Augmented Generation systems at scale. Learn document ingestion pipelines, chunking strategies, embedding models, retrieval optimization, reranking, and hybrid search from engineers who've deployed RAG in production.
Why RAG? The Problem We're Actually Solving
Let me be direct: LLMs are powerful but they have a fundamental problem. They only know what they were trained on, and that knowledge has a cutoff date. Ask GPT-4 about your company's Q3 earnings or your internal API documentation, and you'll get a polite "I don't have information about that" or worse, a confident hallucination.
RAG solves this by giving the model access to your data at inference time. Instead of hoping the model memorized the right information, you retrieve relevant documents and feed them directly into the prompt. Simple concept, but the devil is in the implementation details.
We've built RAG systems that handle millions of documents across dozens of enterprise deployments. Here's what we've learned about making them work at scale.
RAG isn't just about adding documents to a prompt. It's about building a retrieval system that consistently finds the right information, even when users ask questions in unexpected ways.
The RAG Pipeline: End-to-End Architecture
Before diving into components, let's understand how everything fits together. A production RAG system has two main phases:
Ingestion Phase (Offline)
Documents → Preprocessing → Chunking → Embedding → Vector Storage
Query Phase (Online)
User Query → Query Processing → Retrieval → Reranking → LLM Generation
| Phase | When It Runs | Latency Requirements | Primary Goal |
|---|---|---|---|
| Ingestion | Batch/Scheduled | Minutes to hours acceptable | Maximize recall potential |
| Query | Real-time | Sub-second | Precision + Speed |
The ingestion phase is where you prepare your knowledge base. The query phase is where you actually answer questions. Both need to be optimized, but they have very different constraints.
Document Ingestion: Getting Your Data RAG-Ready
Source Connectors: Where Your Data Lives
Enterprise data is scattered everywhere. We've built connectors for:
| Source Type | Examples | Challenges |
|---|---|---|
| Document Storage | SharePoint, Google Drive, S3 | Access control, incremental sync |
| Databases | PostgreSQL, MongoDB, Snowflake | Schema mapping, query complexity |
| SaaS Platforms | Salesforce, Zendesk, Confluence | API rate limits, pagination |
| Communication | Slack, Teams, Email | Privacy, threading context |
| Code Repositories | GitHub, GitLab | File relationships, version history |
The key insight: don't just dump everything into your vector store. Build smart connectors that:
- Respect access controls - If a user can't access a document in SharePoint, they shouldn't retrieve it via RAG
- Handle incremental updates - Re-processing millions of documents because one changed is wasteful
- Preserve metadata - Document creation date, author, and source are crucial for filtering and attribution
// Example: Smart document sync with change detection
const syncDocuments = async (source) => {
const lastSync = await db.getLastSyncTime(source.id);
const changes = await source.getChangesSince(lastSync);
for (const doc of changes.modified) {
const chunks = await processDocument(doc);
await vectorStore.upsert(chunks, {
sourceId: source.id,
documentId: doc.id,
permissions: doc.accessControl
});
}
for (const docId of changes.deleted) {
await vectorStore.deleteByDocumentId(docId);
}
};
Document Processing: Handling Real-World Formats
PDFs are the bane of every RAG engineer's existence. They look simple but contain nightmares: multi-column layouts, embedded tables, scanned images, headers and footers that repeat on every page.
Here's our processing hierarchy:
| Document Type | Processing Approach | Quality Notes |
|---|---|---|
| Markdown/Plain Text | Direct extraction | Excellent quality |
| HTML/Web Pages | DOM parsing + cleaning | Good, watch for boilerplate |
| Word Documents | python-docx or similar | Good, preserve structure |
| PDFs (digital) | PyMuPDF + layout analysis | Varies wildly |
| PDFs (scanned) | OCR + layout analysis | Lower quality, verify accuracy |
| Spreadsheets | Cell-aware extraction | Requires semantic understanding |
| Images/Diagrams | Vision models + OCR | Emerging capability |
For PDFs specifically, we've found that layout-aware extraction makes a huge difference:
# Bad: Simple text extraction loses structure
text = pdf_page.get_text() # "Revenue Q1 Q2 Q3 1000 1200 1500"
# Better: Layout-aware extraction preserves tables
blocks = pdf_page.get_text("dict")["blocks"]
tables = identify_tables(blocks)
# Results in structured data you can actually use
Chunking Strategies: The Heart of Good Retrieval
This is where most RAG implementations fail. Bad chunking leads to poor retrieval, and no amount of fancy reranking can fix fundamentally broken chunks.
Why Chunk Size Matters
Chunks that are too small lack context. Chunks that are too large dilute relevance and waste precious context window space.
| Chunk Size | Pros | Cons | Best For |
|---|---|---|---|
| Small (100-200 tokens) | High precision | Loses context | FAQ, definitions |
| Medium (300-500 tokens) | Balanced | Jack of all trades | General knowledge bases |
| Large (500-1000 tokens) | Rich context | Lower precision, expensive | Technical documentation |
Chunking Approaches We Actually Use
1. Recursive Character Splitting (Baseline)
The simplest approach: split on paragraphs, then sentences, then characters if needed. Works surprisingly well for homogeneous documents.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
2. Semantic Chunking (Better for Diverse Content)
Instead of fixed sizes, detect topic shifts using embeddings. When the semantic similarity between consecutive sentences drops significantly, start a new chunk.
def semantic_chunking(sentences, embedding_model, threshold=0.5):
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(
embedding_model.encode(sentences[i-1]),
embedding_model.encode(sentences[i])
)
if similarity < threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
return chunks
3. Document-Structure-Aware Chunking (Best for Technical Docs)
Use document structure: headers, sections, code blocks. A function definition should stay together. A section with its subsections forms a natural unit.
| Document Element | Chunking Strategy |
|---|---|
| Headers (H1, H2) | Use as chunk boundaries |
| Code blocks | Keep intact, include surrounding context |
| Tables | Extract as structured data + text description |
| Lists | Keep with preceding context |
| Paragraphs | Respect as minimum units |
The Overlap Strategy
Overlap between chunks helps preserve context across boundaries. We typically use 10-20% overlap:
Chunk 1: [-------- content --------][overlap]
Chunk 2: [overlap][-------- content --------]
But overlap isn't free - it increases storage and can cause duplicate retrievals. For large corpora, we use sliding window with deduplication at query time.
Embedding Models: Converting Text to Vectors
Your embedding model determines how well semantic similarity maps to actual relevance. Choose wrong, and queries won't find matching documents even when they exist.
Model Comparison
| Model | Dimensions | Strengths | Weaknesses | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Excellent quality, multilingual | API dependency, cost at scale | ~$0.13/1M tokens |
| OpenAI text-embedding-3-small | 1536 | Good quality, faster | Slightly lower quality | ~$0.02/1M tokens |
| Cohere embed-v3 | 1024 | Strong multilingual | API dependency | ~$0.10/1M tokens |
| BGE-large-en-v1.5 | 1024 | Self-hosted, fast | English-focused | Self-hosted |
| E5-mistral-7b-instruct | 4096 | State-of-the-art quality | Heavy, slow | Self-hosted |
| GTE-Qwen2-7B-instruct | 3584 | Excellent quality | Resource intensive | Self-hosted |
When to Fine-tune Your Embedding Model
Off-the-shelf models work well for general content. But for domain-specific vocabularies - legal, medical, technical - fine-tuning can improve retrieval by 15-30%.
Signs you need fine-tuning:
- Industry-specific terminology isn't matching well
- Acronyms in your domain have different meanings than common usage
- Your documents have unique structural patterns
# Fine-tuning with sentence-transformers
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# Prepare training pairs from your domain
train_examples = [
InputExample(texts=["user query", "relevant document"]),
# ... more examples
]
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3)
Embedding Best Practices
Batch Processing: Never embed one document at a time in production. Batch for throughput.
# Bad: O(n) API calls
for doc in documents:
embedding = model.encode(doc)
# Good: O(1) API call
embeddings = model.encode(documents, batch_size=32)
Normalize Vectors: Most similarity searches assume normalized vectors. Ensure your embeddings are L2-normalized.
Cache Aggressively: Embedding the same query twice is pure waste. Use a query cache with TTL.
Vector Databases: Storing and Searching at Scale
Your vector database handles the heavy lifting of similarity search. The choice matters enormously at scale.
Comparison Matrix
| Database | Type | Max Scale | Filtering | Strengths |
|---|---|---|---|---|
| Pinecone | Managed | 1B+ vectors | Excellent | Easy to start, auto-scaling |
| Weaviate | Self-hosted/Cloud | 100M+ | Good | GraphQL API, hybrid search |
| Qdrant | Self-hosted/Cloud | 100M+ | Excellent | Performance, Rust-based |
| Milvus | Self-hosted | 1B+ | Good | Scale, GPU support |
| pgvector | PostgreSQL extension | 10M | Basic | Simplicity, existing infra |
| Chroma | Embedded | 1M | Basic | Development, prototyping |
Indexing Strategies
The index type dramatically affects query performance and recall:
| Index Type | Build Time | Query Time | Recall | Memory |
|---|---|---|---|---|
| Flat (brute force) | O(1) | O(n) | 100% | Low |
| IVF | Medium | Fast | 95-99% | Medium |
| HNSW | Slow | Very fast | 98-99% | High |
| PQ (Product Quantization) | Fast | Fast | 90-95% | Very low |
For most production systems, HNSW provides the best balance. But at billions of vectors, you'll likely need IVF-PQ with careful tuning.
# Qdrant HNSW configuration example
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
hnsw_config={
"m": 16, # Connections per node
"ef_construct": 100 # Construction accuracy
}
)
Retrieval Optimization: Finding the Right Documents
Query Transformation
Users don't ask questions the way documents are written. Query transformation bridges this gap.
| Technique | How It Works | When to Use |
|---|---|---|
| Query expansion | Add synonyms and related terms | Technical domains with varied terminology |
| HyDE (Hypothetical Document Embeddings) | Generate a hypothetical answer, embed that | When queries are very different from documents |
| Query decomposition | Break complex queries into sub-queries | Multi-part questions |
| Query rewriting | LLM rewrites query for better retrieval | Conversational/ambiguous queries |
# HyDE implementation
def hyde_retrieval(query, llm, retriever):
# Generate hypothetical answer
hypothetical = llm.generate(
f"Write a short passage that would answer: {query}"
)
# Search using the hypothetical document
results = retriever.search(hypothetical)
return results
Hybrid Search: Combining Vector + Keyword
Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid combines both.
| Approach | Vector Weight | Keyword Weight | Best For |
|---|---|---|---|
| Vector-first | 0.8 | 0.2 | General knowledge |
| Balanced | 0.5 | 0.5 | Mixed content |
| Keyword-first | 0.2 | 0.8 | Technical with exact terms |
| Reciprocal Rank Fusion | Dynamic | Dynamic | Unknown query distribution |
def hybrid_search(query, vector_store, keyword_index, alpha=0.7):
# Vector search
vector_results = vector_store.search(query, k=20)
# BM25 keyword search
keyword_results = keyword_index.search(query, k=20)
# Reciprocal Rank Fusion
scores = {}
k = 60 # RRF constant
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + alpha / (k + rank)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + (1-alpha) / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Reranking: Precision When It Matters
Initial retrieval casts a wide net. Reranking uses a more expensive model to precisely order the top candidates.
Reranking Models
| Model | Approach | Latency | Quality |
|---|---|---|---|
| Cohere Rerank | Cross-encoder API | ~100ms | Excellent |
| BGE-reranker-large | Self-hosted cross-encoder | ~50ms | Very good |
| ColBERT | Late interaction | ~30ms | Good |
| LLM-based reranking | Prompt-based scoring | ~500ms | Excellent but slow |
When to Rerank
Reranking adds latency. Use it strategically:
def smart_retrieval(query, top_k=5):
# Fast initial retrieval
candidates = vector_search(query, k=100)
# Rerank only if needed
if needs_precision(query):
candidates = reranker.rerank(query, candidates)
return candidates[:top_k]
def needs_precision(query):
# Rerank for specific, fact-seeking queries
# Skip for broad, exploratory queries
return query_classifier.predict(query) == "factual"
Production Considerations
Monitoring and Observability
You can't improve what you don't measure. Track these metrics:
| Metric | What It Tells You | Target |
|---|---|---|
| Retrieval latency (p50, p99) | User experience | <200ms p99 |
| Recall@k | Are relevant docs in results? | >95% |
| MRR (Mean Reciprocal Rank) | Is the right doc near the top? | >0.7 |
| LLM attribution rate | Is LLM using retrieved context? | >80% |
| User feedback (thumbs up/down) | End-to-end quality | >90% positive |
Caching Strategies
RAG involves expensive operations. Cache aggressively:
| Component | Cache Strategy | TTL |
|---|---|---|
| Query embeddings | LRU with semantic dedup | 1 hour |
| Search results | Query hash → results | 15 min |
| Document chunks | Permanent until doc changes | - |
| LLM responses | Query + context hash | 5 min |
Handling Updates
Your knowledge base isn't static. Handle updates without rebuilding everything:
- Incremental indexing: Update only changed documents
- Version control: Track document versions, support rollback
- Cache invalidation: Bust caches when source documents change
- Consistency checks: Periodically verify vector store matches source of truth
Common Pitfalls and How to Avoid Them
| Pitfall | Symptom | Solution |
|---|---|---|
| Chunking too small | Retrieved chunks lack context | Increase size, add overlap |
| Chunking too large | Irrelevant content retrieved | Decrease size, use structure |
| Ignoring metadata | Can't filter by date/source | Store and index metadata |
| Single retrieval strategy | Works for some queries, fails for others | Implement hybrid search |
| No reranking | Top result often wrong | Add cross-encoder reranker |
| Embedding model mismatch | Technical terms don't match | Fine-tune or use domain model |
| Ignoring document structure | Tables, code blocks garbled | Structure-aware processing |
Real-World Performance Numbers
From our production deployments:
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Query latency (p50) | 850ms | 180ms |
| Query latency (p99) | 2.5s | 450ms |
| Retrieval accuracy | 72% | 94% |
| User satisfaction | 68% | 91% |
| Cost per query | $0.08 | $0.03 |
The biggest wins came from:
- Proper chunking strategy (not too small, not too large)
- Hybrid search with tuned weights
- Aggressive caching at multiple layers
- Reranking for precision-critical queries
Getting Started
If you're building your first RAG system:
- Start simple: Use a managed vector database, off-the-shelf embedding model, basic chunking
- Measure everything: Set up monitoring from day one
- Build a test set: Create query-document pairs to measure retrieval quality
- Iterate based on data: Don't over-engineer; optimize what measurements show is broken
If you're scaling an existing RAG system:
- Profile your pipeline: Find the actual bottlenecks
- Consider hybrid search: Pure vector often isn't enough
- Add reranking: It's often the highest-ROI optimization
- Invest in chunking: This is where most quality issues originate
RAG is not a solved problem. It's a set of trade-offs between latency, accuracy, and cost. The best systems are the ones that make these trade-offs consciously and measure the results.
We've helped dozens of organizations build RAG systems that actually work in production. If you're struggling with retrieval quality or scaling challenges, we'd be happy to share what we've learned.
Topics covered
Ready to implement agentic AI?
Our team specializes in building production-ready AI systems. Let's discuss how we can help you leverage agentic AI for your enterprise.
Start a conversation