Vector Search Architecture: Building Production-Ready Similarity Search Systems
Technical guide to vector search systems. Learn about vector databases, indexing algorithms (HNSW, IVF), similarity metrics, and scaling strategies.
Why Vector Search Matters Now
Here's the thing about traditional search: it only finds exact matches. You search for "running shoes" and miss all the results about "jogging sneakers" because the words don't match. Vector search fixes this fundamental limitation by understanding meaning, not just matching keywords.
We've deployed vector search systems for e-commerce product discovery, document retrieval in RAG pipelines, and recommendation engines handling millions of queries per day. The technology has matured significantly in the past two years, and the tooling is finally production-ready.
Vector search isn't just about finding similar items. It's about building systems that understand what users actually mean, not just what they type.
Let me walk you through how to build these systems properly, from choosing the right database to scaling for real traffic.
How Vector Search Actually Works
Before diving into databases and algorithms, let's understand what we're actually doing. Vector search works in three steps:
Step 1: Create Embeddings
You take your data (text, images, products, whatever) and convert it into vectors using an embedding model. These vectors are arrays of numbers that capture the semantic meaning of your content.
// Using OpenAI embeddings
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: "comfortable running shoes for marathons"
});
const vector = response.data[0].embedding;
// Returns: [0.0023, -0.0142, 0.0089, ...] (1536 dimensions)
Step 2: Store and Index Vectors
You store these vectors in a specialized database that builds an index for fast retrieval. This is where the magic happens with algorithms like HNSW and IVF.
Step 3: Query by Similarity
When a user searches, you convert their query to a vector and find the nearest neighbors in your index. The database returns the most similar items ranked by distance.
| Component | Purpose | Example |
|---|---|---|
| Embedding Model | Converts data to vectors | OpenAI ada-002, Cohere embed-v3 |
| Vector Database | Stores and indexes vectors | Pinecone, Weaviate, Qdrant |
| Distance Metric | Measures similarity | Cosine, Euclidean, Dot Product |
| ANN Algorithm | Fast approximate search | HNSW, IVF, PQ |
Choosing a Vector Database
Let's compare the major options. I'll be honest about trade-offs because each database excels in different scenarios.
Pinecone: Managed Simplicity
Pinecone is the easiest to get started with. Fully managed, scales automatically, and just works. The downside? You're locked into their infrastructure and pricing can get steep at scale.
import { Pinecone } from '@pinecone-database/pinecone';
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index('products');
// Upsert vectors
await index.upsert([
{
id: 'product-123',
values: embedding,
metadata: { category: 'shoes', price: 129.99 }
}
]);
// Query with metadata filtering
const results = await index.query({
vector: queryEmbedding,
topK: 10,
filter: { category: { $eq: 'shoes' } },
includeMetadata: true
});
Best for: Teams that want to ship fast without managing infrastructure. Startups with funding. Use cases under 10M vectors where cost isn't the primary concern.
Weaviate: Schema-First with GraphQL
Weaviate takes a different approach with its schema-based design and GraphQL API. You define object classes with properties, and Weaviate handles vectorization automatically if you want.
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'your-cluster.weaviate.network'
});
// Define schema
await client.schema.classCreator().withClass({
class: 'Product',
vectorizer: 'text2vec-openai',
properties: [
{ name: 'name', dataType: ['text'] },
{ name: 'description', dataType: ['text'] },
{ name: 'price', dataType: ['number'] }
]
}).do();
// Query with GraphQL
const result = await client.graphql
.get()
.withClassName('Product')
.withNearText({ concepts: ['marathon running shoes'] })
.withLimit(10)
.withFields('name description price _additional { distance }')
.do();
Best for: Teams building knowledge graphs, applications needing hybrid search (keyword + vector), projects where schema enforcement matters.
Qdrant: Performance-Focused and Open Source
Qdrant is written in Rust and optimized for performance. It's open source, can be self-hosted, and has an excellent filtering system. We've seen it handle 50M+ vectors with sub-100ms latencies on modest hardware.
import { QdrantClient } from '@qdrant/js-client-rest';
const client = new QdrantClient({ url: 'http://localhost:6333' });
// Create collection with optimized settings
await client.createCollection('products', {
vectors: {
size: 1536,
distance: 'Cosine'
},
optimizers_config: {
indexing_threshold: 20000
},
hnsw_config: {
m: 16,
ef_construct: 100
}
});
// Upsert with payload
await client.upsert('products', {
points: [
{
id: 'product-123',
vector: embedding,
payload: { category: 'shoes', price: 129.99, in_stock: true }
}
]
});
// Query with complex filters
const results = await client.search('products', {
vector: queryEmbedding,
limit: 10,
filter: {
must: [
{ key: 'category', match: { value: 'shoes' } },
{ key: 'price', range: { lte: 150 } },
{ key: 'in_stock', match: { value: true } }
]
}
});
Best for: Self-hosted deployments, cost-sensitive projects at scale, applications needing complex filtering, teams that want control over infrastructure.
pgvector: Vector Search in PostgreSQL
If you're already running PostgreSQL, pgvector lets you add vector search without another database. It's not the fastest option, but it's often fast enough and dramatically simplifies your architecture.
-- Enable extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT,
description TEXT,
category TEXT,
price NUMERIC,
embedding VECTOR(1536)
);
-- Create HNSW index for faster queries
CREATE INDEX ON products
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Query nearest neighbors
SELECT id, name, price,
1 - (embedding <=> query_embedding) AS similarity
FROM products
WHERE category = 'shoes' AND price < 150
ORDER BY embedding <=> query_embedding
LIMIT 10;
Best for: Teams already on PostgreSQL, applications under 5M vectors, use cases where you need ACID transactions with vector search, reducing operational complexity.
Vector Database Comparison
| Feature | Pinecone | Weaviate | Qdrant | pgvector |
|---|---|---|---|---|
| Deployment | Managed only | Managed + Self-hosted | Managed + Self-hosted | Self-hosted |
| Max Scale | Billions | Hundreds of millions | Hundreds of millions | ~10 million |
| Query Latency | <50ms | <100ms | <50ms | <200ms |
| Filtering | Good | Excellent | Excellent | Native SQL |
| Learning Curve | Easy | Moderate | Easy | Minimal if you know SQL |
| Cost at Scale | High | Moderate | Low (self-hosted) | Low |
| Hybrid Search | Limited | Excellent | Good | Via full-text search |
Understanding Indexing Algorithms
The index algorithm determines how quickly you can search millions of vectors. Here's what you need to know about the two most common approaches.
HNSW: Hierarchical Navigable Small World
HNSW is the dominant algorithm for vector search. It builds a multi-layer graph where higher layers have fewer nodes and larger jumps, while lower layers have more nodes and shorter connections. Search starts at the top and drills down.
| Parameter | What it controls | Trade-off |
|---|---|---|
m | Number of connections per node | Higher = better recall, more memory |
ef_construct | Search width during index building | Higher = better quality, slower indexing |
ef_search | Search width during queries | Higher = better recall, slower queries |
// Qdrant HNSW configuration
const hnswConfig = {
m: 16, // 16 connections per node (default)
ef_construct: 100, // Construction quality
full_scan_threshold: 10000 // Switch to brute force below this
};
// For high-recall requirements
const highRecallConfig = {
m: 32,
ef_construct: 200
};
// For memory-constrained environments
const lowMemoryConfig = {
m: 8,
ef_construct: 50
};
Pros: Fast queries, good recall, works well for most use cases Cons: Memory-intensive, slow index updates, not great for streaming data
IVF: Inverted File Index
IVF clusters your vectors and searches only relevant clusters. It's faster to build than HNSW and uses less memory, but typically has lower recall.
| Parameter | What it controls | Trade-off |
|---|---|---|
nlist | Number of clusters | Higher = more precision, slower queries |
nprobe | Clusters to search | Higher = better recall, slower queries |
# Faiss IVF example
import faiss
dimension = 1536
nlist = 100 # Number of clusters
# Create IVF index with flat quantizer
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Train on sample data
index.train(training_vectors)
index.add(all_vectors)
# Search with nprobe
index.nprobe = 10 # Search 10 clusters
distances, indices = index.search(query_vector, k=10)
Pros: Memory-efficient, fast index building, good for streaming updates Cons: Lower recall than HNSW, requires training step, more tuning needed
Product Quantization: Compressing Vectors
When memory is tight, Product Quantization (PQ) compresses vectors by splitting them into subvectors and using codebooks. You trade recall for dramatic memory savings.
# IVF + PQ for massive scale
m = 8 # Number of subquantizers
bits = 8 # Bits per subvector
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, bits)
# Reduces memory ~32x compared to flat index
Use PQ when: You have 100M+ vectors and can't afford full-precision storage. Expect 5-10% recall drop.
Similarity Metrics: Choosing the Right Distance
The distance metric determines how similarity is calculated. The choice matters more than you'd think.
| Metric | Formula | Best For |
|---|---|---|
| Cosine | 1 - (A.B / ||A|| ||B||) | Text embeddings, normalized vectors |
| Euclidean (L2) | sqrt(sum((A-B)^2)) | Image embeddings, dense features |
| Dot Product | A.B | When magnitude matters, recommendations |
Rule of thumb: If your embeddings come from a text model (OpenAI, Cohere, etc.), use cosine similarity. The vectors are already normalized, and cosine handles them correctly. For image embeddings or custom models, Euclidean often works better.
// Most text embedding APIs return normalized vectors
// Cosine similarity = dot product for normalized vectors
const cosineSimilarity = (a, b) => {
return a.reduce((sum, val, i) => sum + val * b[i], 0);
};
// Euclidean distance for non-normalized vectors
const euclideanDistance = (a, b) => {
return Math.sqrt(
a.reduce((sum, val, i) => sum + Math.pow(val - b[i], 2), 0)
);
};
Scaling Vector Search for Production
Here's where theory meets reality. Let me share what we've learned scaling vector search to handle millions of queries.
Sharding Strategies
When one node can't hold all your vectors, you need to shard. There are two main approaches:
Geographic Sharding: Split by region if queries are localized
shard_us: US products (5M vectors)
shard_eu: EU products (3M vectors)
shard_apac: APAC products (2M vectors)
Hash-Based Sharding: Distribute evenly across nodes
const shardId = hash(productId) % numShards;
Metadata-Based Sharding: Split by category or attribute
shard_electronics: Electronics products
shard_clothing: Clothing products
shard_home: Home & Garden products
Replication for Availability
Run at least 3 replicas for production workloads. Vector search is read-heavy, so replicas also help with throughput.
# Qdrant cluster configuration
cluster:
enabled: true
replication_factor: 3
shard_number: 6
# Pinecone (automatic)
# Just select "production" tier with multiple pods
# Weaviate
replicationConfig:
factor: 3
Caching Hot Queries
Some queries are much more common than others. Cache them.
import { Redis } from 'ioredis';
const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour
async function searchWithCache(query, filters) {
const cacheKey = `vsearch:${hash(query)}:${hash(filters)}`;
// Check cache first
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// Query vector database
const embedding = await generateEmbedding(query);
const results = await vectorDB.search({
vector: embedding,
filter: filters,
limit: 20
});
// Cache results
await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(results));
return results;
}
Batching and Async Processing
Don't process vectors one at a time. Batch everything.
// Bad: Sequential processing
for (const doc of documents) {
const embedding = await generateEmbedding(doc.text);
await vectorDB.upsert({ id: doc.id, vector: embedding });
}
// Good: Batch processing
const BATCH_SIZE = 100;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
const batch = documents.slice(i, i + BATCH_SIZE);
// Generate embeddings in parallel
const embeddings = await Promise.all(
batch.map(doc => generateEmbedding(doc.text))
);
// Upsert in batch
await vectorDB.upsert(
batch.map((doc, j) => ({
id: doc.id,
vector: embeddings[j],
metadata: doc.metadata
}))
);
}
Production Architecture Example
Here's a real architecture we use for an e-commerce search system handling 10M+ products and 1000+ queries per second:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β API Gateway β
β (Rate limiting, auth, routing) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
βββββββΌββββββ ββββββΌβββββ ββββββΌβββββ
β Cache β β Search β β Search β
β (Redis) β β Node 1 β β Node 2 β
βββββββββββββ ββββββ¬βββββ ββββββ¬βββββ
β β
βββββββββΌββββββββββββΌββββββββ
β Qdrant Cluster β
β (3 shards, 3 replicas) β
βββββββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββββββ
β Embedding Service β
β (GPU-accelerated) β
βββββββββββββββββββββββββββββ
Key Components
Search Nodes: Stateless services that handle query processing, embedding generation, and result ranking. We run 2-4 instances behind a load balancer.
Qdrant Cluster: 3 shards for data distribution, 3 replicas for availability. Each shard handles ~3.5M vectors. Total memory: ~50GB across the cluster.
Embedding Service: Dedicated GPU service for generating embeddings. We use ONNX-optimized models for 10x faster inference than vanilla transformers.
Redis Cache: Caches common queries and hot embeddings. Reduces Qdrant load by ~60%.
Common Pitfalls and How to Avoid Them
Pitfall 1: Not Normalizing Vectors
If you're using cosine similarity but your vectors aren't normalized, you'll get wrong results. Most embedding APIs return normalized vectors, but if you're using a custom model, normalize them yourself.
function normalize(vector) {
const magnitude = Math.sqrt(
vector.reduce((sum, val) => sum + val * val, 0)
);
return vector.map(val => val / magnitude);
}
Pitfall 2: Ignoring the Embedding Model
The embedding model matters more than the database. A good model with pgvector will beat a bad model with Pinecone. Spend time evaluating embedding quality before optimizing infrastructure.
Pitfall 3: Not Planning for Updates
Most vector databases are optimized for reads, not writes. If you need frequent updates, design for it:
- Use write-ahead logs
- Batch updates during low-traffic periods
- Consider a staging index for new data
Pitfall 4: Over-Filtering Before Vector Search
Filtering after vector search is usually more efficient than filtering before. Let the vector index do its job, then filter the results.
// Less efficient: Filter first, then search remaining vectors
// More efficient: Search all, then filter top results
const results = await vectorDB.search({
vector: queryEmbedding,
limit: 100 // Get more results than needed
});
const filtered = results.filter(r =>
r.metadata.price < maxPrice &&
r.metadata.in_stock
).slice(0, 10);
Monitoring and Observability
You can't improve what you don't measure. Track these metrics:
| Metric | Target | Action if exceeded |
|---|---|---|
| p50 latency | <50ms | Check index configuration |
| p99 latency | <200ms | Add replicas or shards |
| Recall@10 | >95% | Increase ef_search or m |
| QPS per node | <1000 | Add more nodes |
| Memory usage | <80% | Shard or use PQ compression |
// Track search latency
const startTime = Date.now();
const results = await vectorDB.search(query);
const latency = Date.now() - startTime;
metrics.histogram('vector_search_latency_ms', latency, {
collection: 'products',
filter_count: Object.keys(query.filter || {}).length
});
// Track recall (requires ground truth)
const recall = calculateRecall(results, groundTruth);
metrics.gauge('vector_search_recall', recall);
Getting Started: Practical Recommendations
If you're starting fresh, here's what I'd recommend:
-
Under 1M vectors: Start with pgvector. It's simple, it's probably fast enough, and you already know SQL.
-
1M-10M vectors: Qdrant self-hosted gives you the best performance per dollar. Pinecone if you don't want to manage infrastructure.
-
10M-100M vectors: Qdrant or Weaviate with proper sharding. Consider Pinecone's enterprise tier if budget allows.
-
100M+ vectors: You need specialized architecture. Consider Milvus, multi-cluster Qdrant, or custom solutions with Faiss.
Start with HNSW indexes (the default in most databases). Only optimize when you hit actual performance issues. Premature optimization wastes time on problems you might not have.
Conclusion
Vector search has moved from research curiosity to production necessity. Whether you're building semantic search, recommendation systems, or RAG pipelines, understanding these fundamentals will help you make better architectural decisions.
The technology is mature, the tooling is good, and the community has solved most common problems. What remains is choosing the right tools for your specific use case and implementing them thoughtfully.
If you're building vector search systems and want to discuss architecture, we're always happy to share what we've learned from production deployments.
Topics covered
Related Guides
Enterprise RAG Systems: A Technical Deep Dive
Technical guide to building production-ready RAG systems at scale. Learn chunking strategies, embedding models, retrieval optimization, and hybrid search.
Read guideEnterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideAgentic Commerce: How to Let AI Agents Buy Things Safely
How to design governed AI agent-initiated commerce. Policy engines, HITL approval gates, HMAC receipts, idempotency, tenant scoping, and the full Agentic Checkout Protocol.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation