Technical Guide

Vector Search Architecture: Building Production-Ready Similarity Search Systems

Technical guide to vector search systems. Learn about vector databases, indexing algorithms (HNSW, IVF), similarity metrics, and scaling strategies.

February 10, 202618 min readOronts Engineering Team

Why Vector Search Matters Now

Here's the thing about traditional search: it only finds exact matches. You search for "running shoes" and miss all the results about "jogging sneakers" because the words don't match. Vector search fixes this fundamental limitation by understanding meaning, not just matching keywords.

We've deployed vector search systems for e-commerce product discovery, document retrieval in RAG pipelines, and recommendation engines handling millions of queries per day. The technology has matured significantly in the past two years, and the tooling is finally production-ready.

Vector search isn't just about finding similar items. It's about building systems that understand what users actually mean, not just what they type.

Let me walk you through how to build these systems properly, from choosing the right database to scaling for real traffic.

How Vector Search Actually Works

Before diving into databases and algorithms, let's understand what we're actually doing. Vector search works in three steps:

Step 1: Create Embeddings

You take your data (text, images, products, whatever) and convert it into vectors using an embedding model. These vectors are arrays of numbers that capture the semantic meaning of your content.

// Using OpenAI embeddings
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "comfortable running shoes for marathons"
});

const vector = response.data[0].embedding;
// Returns: [0.0023, -0.0142, 0.0089, ...] (1536 dimensions)

Step 2: Store and Index Vectors

You store these vectors in a specialized database that builds an index for fast retrieval. This is where the magic happens with algorithms like HNSW and IVF.

Step 3: Query by Similarity

When a user searches, you convert their query to a vector and find the nearest neighbors in your index. The database returns the most similar items ranked by distance.

ComponentPurposeExample
Embedding ModelConverts data to vectorsOpenAI ada-002, Cohere embed-v3
Vector DatabaseStores and indexes vectorsPinecone, Weaviate, Qdrant
Distance MetricMeasures similarityCosine, Euclidean, Dot Product
ANN AlgorithmFast approximate searchHNSW, IVF, PQ

Choosing a Vector Database

Let's compare the major options. I'll be honest about trade-offs because each database excels in different scenarios.

Pinecone: Managed Simplicity

Pinecone is the easiest to get started with. Fully managed, scales automatically, and just works. The downside? You're locked into their infrastructure and pricing can get steep at scale.

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index('products');

// Upsert vectors
await index.upsert([
  {
    id: 'product-123',
    values: embedding,
    metadata: { category: 'shoes', price: 129.99 }
  }
]);

// Query with metadata filtering
const results = await index.query({
  vector: queryEmbedding,
  topK: 10,
  filter: { category: { $eq: 'shoes' } },
  includeMetadata: true
});

Best for: Teams that want to ship fast without managing infrastructure. Startups with funding. Use cases under 10M vectors where cost isn't the primary concern.

Weaviate: Schema-First with GraphQL

Weaviate takes a different approach with its schema-based design and GraphQL API. You define object classes with properties, and Weaviate handles vectorization automatically if you want.

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: 'your-cluster.weaviate.network'
});

// Define schema
await client.schema.classCreator().withClass({
  class: 'Product',
  vectorizer: 'text2vec-openai',
  properties: [
    { name: 'name', dataType: ['text'] },
    { name: 'description', dataType: ['text'] },
    { name: 'price', dataType: ['number'] }
  ]
}).do();

// Query with GraphQL
const result = await client.graphql
  .get()
  .withClassName('Product')
  .withNearText({ concepts: ['marathon running shoes'] })
  .withLimit(10)
  .withFields('name description price _additional { distance }')
  .do();

Best for: Teams building knowledge graphs, applications needing hybrid search (keyword + vector), projects where schema enforcement matters.

Qdrant: Performance-Focused and Open Source

Qdrant is written in Rust and optimized for performance. It's open source, can be self-hosted, and has an excellent filtering system. We've seen it handle 50M+ vectors with sub-100ms latencies on modest hardware.

import { QdrantClient } from '@qdrant/js-client-rest';

const client = new QdrantClient({ url: 'http://localhost:6333' });

// Create collection with optimized settings
await client.createCollection('products', {
  vectors: {
    size: 1536,
    distance: 'Cosine'
  },
  optimizers_config: {
    indexing_threshold: 20000
  },
  hnsw_config: {
    m: 16,
    ef_construct: 100
  }
});

// Upsert with payload
await client.upsert('products', {
  points: [
    {
      id: 'product-123',
      vector: embedding,
      payload: { category: 'shoes', price: 129.99, in_stock: true }
    }
  ]
});

// Query with complex filters
const results = await client.search('products', {
  vector: queryEmbedding,
  limit: 10,
  filter: {
    must: [
      { key: 'category', match: { value: 'shoes' } },
      { key: 'price', range: { lte: 150 } },
      { key: 'in_stock', match: { value: true } }
    ]
  }
});

Best for: Self-hosted deployments, cost-sensitive projects at scale, applications needing complex filtering, teams that want control over infrastructure.

pgvector: Vector Search in PostgreSQL

If you're already running PostgreSQL, pgvector lets you add vector search without another database. It's not the fastest option, but it's often fast enough and dramatically simplifies your architecture.

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE products (
  id SERIAL PRIMARY KEY,
  name TEXT,
  description TEXT,
  category TEXT,
  price NUMERIC,
  embedding VECTOR(1536)
);

-- Create HNSW index for faster queries
CREATE INDEX ON products
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Query nearest neighbors
SELECT id, name, price,
       1 - (embedding <=> query_embedding) AS similarity
FROM products
WHERE category = 'shoes' AND price < 150
ORDER BY embedding <=> query_embedding
LIMIT 10;

Best for: Teams already on PostgreSQL, applications under 5M vectors, use cases where you need ACID transactions with vector search, reducing operational complexity.

Vector Database Comparison

FeaturePineconeWeaviateQdrantpgvector
DeploymentManaged onlyManaged + Self-hostedManaged + Self-hostedSelf-hosted
Max ScaleBillionsHundreds of millionsHundreds of millions~10 million
Query Latency<50ms<100ms<50ms<200ms
FilteringGoodExcellentExcellentNative SQL
Learning CurveEasyModerateEasyMinimal if you know SQL
Cost at ScaleHighModerateLow (self-hosted)Low
Hybrid SearchLimitedExcellentGoodVia full-text search

Understanding Indexing Algorithms

The index algorithm determines how quickly you can search millions of vectors. Here's what you need to know about the two most common approaches.

HNSW: Hierarchical Navigable Small World

HNSW is the dominant algorithm for vector search. It builds a multi-layer graph where higher layers have fewer nodes and larger jumps, while lower layers have more nodes and shorter connections. Search starts at the top and drills down.

ParameterWhat it controlsTrade-off
mNumber of connections per nodeHigher = better recall, more memory
ef_constructSearch width during index buildingHigher = better quality, slower indexing
ef_searchSearch width during queriesHigher = better recall, slower queries
// Qdrant HNSW configuration
const hnswConfig = {
  m: 16,              // 16 connections per node (default)
  ef_construct: 100,  // Construction quality
  full_scan_threshold: 10000  // Switch to brute force below this
};

// For high-recall requirements
const highRecallConfig = {
  m: 32,
  ef_construct: 200
};

// For memory-constrained environments
const lowMemoryConfig = {
  m: 8,
  ef_construct: 50
};

Pros: Fast queries, good recall, works well for most use cases Cons: Memory-intensive, slow index updates, not great for streaming data

IVF: Inverted File Index

IVF clusters your vectors and searches only relevant clusters. It's faster to build than HNSW and uses less memory, but typically has lower recall.

ParameterWhat it controlsTrade-off
nlistNumber of clustersHigher = more precision, slower queries
nprobeClusters to searchHigher = better recall, slower queries
# Faiss IVF example
import faiss

dimension = 1536
nlist = 100  # Number of clusters

# Create IVF index with flat quantizer
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on sample data
index.train(training_vectors)
index.add(all_vectors)

# Search with nprobe
index.nprobe = 10  # Search 10 clusters
distances, indices = index.search(query_vector, k=10)

Pros: Memory-efficient, fast index building, good for streaming updates Cons: Lower recall than HNSW, requires training step, more tuning needed

Product Quantization: Compressing Vectors

When memory is tight, Product Quantization (PQ) compresses vectors by splitting them into subvectors and using codebooks. You trade recall for dramatic memory savings.

# IVF + PQ for massive scale
m = 8  # Number of subquantizers
bits = 8  # Bits per subvector

index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, bits)
# Reduces memory ~32x compared to flat index

Use PQ when: You have 100M+ vectors and can't afford full-precision storage. Expect 5-10% recall drop.

Similarity Metrics: Choosing the Right Distance

The distance metric determines how similarity is calculated. The choice matters more than you'd think.

MetricFormulaBest For
Cosine1 - (A.B / ||A|| ||B||)Text embeddings, normalized vectors
Euclidean (L2)sqrt(sum((A-B)^2))Image embeddings, dense features
Dot ProductA.BWhen magnitude matters, recommendations

Rule of thumb: If your embeddings come from a text model (OpenAI, Cohere, etc.), use cosine similarity. The vectors are already normalized, and cosine handles them correctly. For image embeddings or custom models, Euclidean often works better.

// Most text embedding APIs return normalized vectors
// Cosine similarity = dot product for normalized vectors
const cosineSimilarity = (a, b) => {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
};

// Euclidean distance for non-normalized vectors
const euclideanDistance = (a, b) => {
  return Math.sqrt(
    a.reduce((sum, val, i) => sum + Math.pow(val - b[i], 2), 0)
  );
};

Scaling Vector Search for Production

Here's where theory meets reality. Let me share what we've learned scaling vector search to handle millions of queries.

Sharding Strategies

When one node can't hold all your vectors, you need to shard. There are two main approaches:

Geographic Sharding: Split by region if queries are localized

shard_us: US products (5M vectors)
shard_eu: EU products (3M vectors)
shard_apac: APAC products (2M vectors)

Hash-Based Sharding: Distribute evenly across nodes

const shardId = hash(productId) % numShards;

Metadata-Based Sharding: Split by category or attribute

shard_electronics: Electronics products
shard_clothing: Clothing products
shard_home: Home & Garden products

Replication for Availability

Run at least 3 replicas for production workloads. Vector search is read-heavy, so replicas also help with throughput.

# Qdrant cluster configuration
cluster:
  enabled: true
  replication_factor: 3
  shard_number: 6

# Pinecone (automatic)
# Just select "production" tier with multiple pods

# Weaviate
replicationConfig:
  factor: 3

Caching Hot Queries

Some queries are much more common than others. Cache them.

import { Redis } from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour

async function searchWithCache(query, filters) {
  const cacheKey = `vsearch:${hash(query)}:${hash(filters)}`;

  // Check cache first
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Query vector database
  const embedding = await generateEmbedding(query);
  const results = await vectorDB.search({
    vector: embedding,
    filter: filters,
    limit: 20
  });

  // Cache results
  await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(results));

  return results;
}

Batching and Async Processing

Don't process vectors one at a time. Batch everything.

// Bad: Sequential processing
for (const doc of documents) {
  const embedding = await generateEmbedding(doc.text);
  await vectorDB.upsert({ id: doc.id, vector: embedding });
}

// Good: Batch processing
const BATCH_SIZE = 100;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
  const batch = documents.slice(i, i + BATCH_SIZE);

  // Generate embeddings in parallel
  const embeddings = await Promise.all(
    batch.map(doc => generateEmbedding(doc.text))
  );

  // Upsert in batch
  await vectorDB.upsert(
    batch.map((doc, j) => ({
      id: doc.id,
      vector: embeddings[j],
      metadata: doc.metadata
    }))
  );
}

Production Architecture Example

Here's a real architecture we use for an e-commerce search system handling 10M+ products and 1000+ queries per second:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Load Balancer                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      API Gateway                             β”‚
β”‚              (Rate limiting, auth, routing)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚            β”‚
   β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
   β”‚   Cache   β”‚ β”‚ Search  β”‚ β”‚ Search  β”‚
   β”‚  (Redis)  β”‚ β”‚ Node 1  β”‚ β”‚ Node 2  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                      β”‚           β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
              β”‚     Qdrant Cluster        β”‚
              β”‚   (3 shards, 3 replicas)  β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Embedding Service       β”‚
              β”‚   (GPU-accelerated)       β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Search Nodes: Stateless services that handle query processing, embedding generation, and result ranking. We run 2-4 instances behind a load balancer.

Qdrant Cluster: 3 shards for data distribution, 3 replicas for availability. Each shard handles ~3.5M vectors. Total memory: ~50GB across the cluster.

Embedding Service: Dedicated GPU service for generating embeddings. We use ONNX-optimized models for 10x faster inference than vanilla transformers.

Redis Cache: Caches common queries and hot embeddings. Reduces Qdrant load by ~60%.

Common Pitfalls and How to Avoid Them

Pitfall 1: Not Normalizing Vectors

If you're using cosine similarity but your vectors aren't normalized, you'll get wrong results. Most embedding APIs return normalized vectors, but if you're using a custom model, normalize them yourself.

function normalize(vector) {
  const magnitude = Math.sqrt(
    vector.reduce((sum, val) => sum + val * val, 0)
  );
  return vector.map(val => val / magnitude);
}

Pitfall 2: Ignoring the Embedding Model

The embedding model matters more than the database. A good model with pgvector will beat a bad model with Pinecone. Spend time evaluating embedding quality before optimizing infrastructure.

Pitfall 3: Not Planning for Updates

Most vector databases are optimized for reads, not writes. If you need frequent updates, design for it:

  • Use write-ahead logs
  • Batch updates during low-traffic periods
  • Consider a staging index for new data

Pitfall 4: Over-Filtering Before Vector Search

Filtering after vector search is usually more efficient than filtering before. Let the vector index do its job, then filter the results.

// Less efficient: Filter first, then search remaining vectors
// More efficient: Search all, then filter top results

const results = await vectorDB.search({
  vector: queryEmbedding,
  limit: 100  // Get more results than needed
});

const filtered = results.filter(r =>
  r.metadata.price < maxPrice &&
  r.metadata.in_stock
).slice(0, 10);

Monitoring and Observability

You can't improve what you don't measure. Track these metrics:

MetricTargetAction if exceeded
p50 latency<50msCheck index configuration
p99 latency<200msAdd replicas or shards
Recall@10>95%Increase ef_search or m
QPS per node<1000Add more nodes
Memory usage<80%Shard or use PQ compression
// Track search latency
const startTime = Date.now();
const results = await vectorDB.search(query);
const latency = Date.now() - startTime;

metrics.histogram('vector_search_latency_ms', latency, {
  collection: 'products',
  filter_count: Object.keys(query.filter || {}).length
});

// Track recall (requires ground truth)
const recall = calculateRecall(results, groundTruth);
metrics.gauge('vector_search_recall', recall);

Getting Started: Practical Recommendations

If you're starting fresh, here's what I'd recommend:

  1. Under 1M vectors: Start with pgvector. It's simple, it's probably fast enough, and you already know SQL.

  2. 1M-10M vectors: Qdrant self-hosted gives you the best performance per dollar. Pinecone if you don't want to manage infrastructure.

  3. 10M-100M vectors: Qdrant or Weaviate with proper sharding. Consider Pinecone's enterprise tier if budget allows.

  4. 100M+ vectors: You need specialized architecture. Consider Milvus, multi-cluster Qdrant, or custom solutions with Faiss.

Start with HNSW indexes (the default in most databases). Only optimize when you hit actual performance issues. Premature optimization wastes time on problems you might not have.

Conclusion

Vector search has moved from research curiosity to production necessity. Whether you're building semantic search, recommendation systems, or RAG pipelines, understanding these fundamentals will help you make better architectural decisions.

The technology is mature, the tooling is good, and the community has solved most common problems. What remains is choosing the right tools for your specific use case and implementing them thoughtfully.

If you're building vector search systems and want to discuss architecture, we're always happy to share what we've learned from production deployments.

Topics covered

vector searchvector databasePineconeWeaviateQdrantpgvectorHNSWIVFsimilarity searchembeddingssemantic searchANNapproximate nearest neighbor

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation