System Architecture & Scalability
A comprehensive guide to designing systems that last. Learn about architectural patterns, API design, authentication systems, real-time infrastructure, and building for scale without over-engineering.
Designing Systems That Last
The hardest part of software architecture isn't building systems that work. It's building systems that keep working—through growth, changing requirements, team turnover, and years of maintenance.
We've seen enough architectural disasters to know what doesn't work. The over-engineered microservices that should have been a monolith. The monolith that should have been split years ago. The "scalable" system that can't handle 100 concurrent users.
Good architecture is about making the right trade-offs for your actual situation—not following patterns blindly or preparing for scale you'll never reach.
The goal isn't the most sophisticated architecture. It's the simplest architecture that solves your problem and can evolve with your business.
Architectural Principles
Before diving into patterns and technologies, here are the principles that guide our decisions.
1. Start Simple, Scale When Needed
Don't build for 10 million users when you have 1,000. That's not planning ahead—it's wasting resources on problems you don't have.
| Stage | Architecture | When to Evolve |
|---|---|---|
| MVP | Monolith, single DB | Validate the business |
| Growth | Monolith, read replicas, caching | Hitting performance limits |
| Scale | Service extraction, async processing | Clear bottlenecks identified |
| Enterprise | Event-driven, distributed | Org/domain boundaries clear |
2. Boring Technology Wins
We use proven, boring technology for critical systems. PostgreSQL over the latest NewSQL database. Node.js over experimental runtimes. Kubernetes over custom orchestration.
Boring but reliable:
├── PostgreSQL (not yet another NoSQL)
├── Redis (not experimental caches)
├── Node.js/TypeScript (not experimental languages)
├── React (not framework-of-the-week)
└── Kubernetes (not custom orchestration)
Innovation is for the edges where it matters. Core infrastructure should be battle-tested.
3. Design for Change
Requirements will change. The architecture should accommodate change without rewrites.
// Bad: Hardcoded assumptions
function processOrder(order) {
charge(order.total); // What about invoicing? Split payments?
ship(order.items); // What about digital goods? Subscriptions?
email(order.customer); // What about SMS? Push notifications?
}
// Good: Extensible through events
async function processOrder(order) {
const result = await chargeOrder(order);
await eventBus.publish('order.paid', { order, result });
// Listeners handle shipping, notifications, inventory, analytics...
}
4. Make It Observable
You can't fix what you can't see. Build observability in from the start.
// Every service call includes context
async function createOrder(data, context: RequestContext) {
const span = tracer.startSpan('createOrder', { parent: context.span });
try {
const order = await orderService.create(data);
metrics.increment('orders.created', { channel: data.channel });
logger.info('Order created', {
orderId: order.id,
total: order.total,
traceId: context.traceId
});
return order;
} catch (error) {
span.recordException(error);
throw error;
} finally {
span.end();
}
}
API Design
APIs are the contracts between systems. Once published, they're hard to change. Design them carefully.
REST Done Right
REST works well for most use cases. The key is consistency.
// Consistent patterns across all endpoints
GET /api/v1/orders // List (paginated)
GET /api/v1/orders/:id // Get one
POST /api/v1/orders // Create
PATCH /api/v1/orders/:id // Partial update
DELETE /api/v1/orders/:id // Delete
// Query parameters for filtering/pagination
GET /api/v1/orders?status=pending&page=1&limit=20&sort=-createdAt
// Nested resources when relationship is strong
GET /api/v1/orders/:id/items
POST /api/v1/orders/:id/refunds
// Response format: consistent structure
{
"data": { ... },
"meta": {
"page": 1,
"limit": 20,
"total": 156
}
}
API Versioning
APIs evolve. Plan for it.
| Strategy | When to Use | Trade-offs |
|---|---|---|
URL versioning (/v1/) | Major breaking changes | Clear, but multiple codebases |
| Header versioning | More granular control | Less discoverable |
| No versioning | Internally only | Simple, but inflexible |
Our default: URL versioning for external APIs, additive changes without versioning when possible.
// Additive changes: don't require version bump
// Old response
{ "name": "John" }
// New response (backwards compatible)
{ "name": "John", "email": "john@example.com" }
// Breaking changes: require new version
// v1: { "name": "John Doe" }
// v2: { "firstName": "John", "lastName": "Doe" }
GraphQL When Appropriate
GraphQL excels when clients have varying data needs.
# Client specifies exactly what it needs
query OrderSummary($id: ID!) {
order(id: $id) {
id
status
total
}
}
query OrderDetails($id: ID!) {
order(id: $id) {
id
status
total
items {
product { name, image }
quantity
price
}
shipments {
carrier
trackingNumber
status
}
history {
timestamp
event
actor
}
}
}
API Security
Every API needs proper security. No exceptions.
| Layer | Implementation | Purpose |
|---|---|---|
| Transport | TLS 1.3 | Encrypt in transit |
| Authentication | OAuth 2.0 / JWT | Verify identity |
| Authorization | RBAC / ABAC | Check permissions |
| Rate Limiting | Token bucket | Prevent abuse |
| Input Validation | Schema validation | Prevent injection |
// Complete API security stack
app.use(helmet()); // Security headers
app.use(cors(corsOptions)); // CORS policy
app.use(rateLimiter); // Rate limiting
app.use(authenticate); // JWT validation
app.use(validateInput(schema)); // Input validation
app.use(authorize(permissions)); // Permission check
Authentication & Authorization
Auth is critical infrastructure. Get it wrong, and everything is compromised.
Authentication: Who Are You?
| Method | Use Case | Security Level |
|---|---|---|
| Session cookies | Web apps | High (with proper config) |
| JWT tokens | APIs, SPAs | High (short expiry + refresh) |
| API keys | Server-to-server | Medium (rotate regularly) |
| OAuth 2.0 | Third-party access | High (proper flow selection) |
// JWT with refresh token flow
interface TokenPair {
accessToken: string; // Short-lived: 15 min
refreshToken: string; // Longer-lived: 7 days, rotating
}
async function login(credentials): Promise<TokenPair> {
const user = await validateCredentials(credentials);
const accessToken = jwt.sign(
{ sub: user.id, roles: user.roles },
ACCESS_SECRET,
{ expiresIn: '15m' }
);
const refreshToken = await createRefreshToken(user.id);
return { accessToken, refreshToken };
}
async function refresh(refreshToken): Promise<TokenPair> {
const valid = await validateRefreshToken(refreshToken);
if (!valid) throw new UnauthorizedError();
// Rotate refresh token (one-time use)
await revokeRefreshToken(refreshToken);
return login(valid.user);
}
Authorization: What Can You Do?
RBAC (Role-Based Access Control) works for most cases. ABAC adds flexibility when needed.
// Role-based access control
const roles = {
admin: ['*'], // Everything
manager: ['orders:*', 'products:read', 'customers:read'],
support: ['orders:read', 'orders:update', 'customers:read'],
viewer: ['orders:read', 'products:read']
};
function authorize(permission: string) {
return (req, res, next) => {
const userPermissions = expandRolePermissions(req.user.roles);
if (!hasPermission(userPermissions, permission)) {
return res.status(403).json({ error: 'Forbidden' });
}
next();
};
}
// Usage
app.delete('/orders/:id', authorize('orders:delete'), deleteOrder);
Real-Time Systems
Modern applications need real-time capabilities. Live updates, collaborative features, instant notifications.
Technology Selection
| Technology | Best For | Trade-offs |
|---|---|---|
| WebSockets | Bidirectional, high-frequency | Connection management complexity |
| Server-Sent Events | Server→Client updates | Simpler, but one-directional |
| Long Polling | Fallback, simple use cases | Higher latency, more requests |
| WebRTC | Peer-to-peer, media | Complex, specific use cases |
WebSocket Architecture
// Scalable WebSocket setup with Redis pub/sub
const io = new Server(server, {
adapter: createAdapter(redisClient) // Scale across multiple servers
});
// Room-based subscriptions
io.on('connection', (socket) => {
// Join user's personal room
socket.join(`user:${socket.user.id}`);
// Join organization room if B2B
if (socket.user.orgId) {
socket.join(`org:${socket.user.orgId}`);
}
});
// Publish updates from anywhere
async function notifyOrderUpdate(order) {
// Notify the customer
io.to(`user:${order.customerId}`).emit('order:updated', order);
// Notify support team
io.to('role:support').emit('order:updated', order);
}
Event-Driven Architecture
For complex systems, events decouple components and enable async processing.
// Event bus interface
interface EventBus {
publish(event: string, payload: any): Promise<void>;
subscribe(event: string, handler: EventHandler): void;
}
// Domain events
type DomainEvent =
| { type: 'order.created', order: Order }
| { type: 'order.paid', order: Order, payment: Payment }
| { type: 'order.shipped', order: Order, shipment: Shipment }
| { type: 'order.delivered', order: Order };
// Loosely coupled handlers
eventBus.subscribe('order.paid', async (event) => {
await inventoryService.reserve(event.order.items);
});
eventBus.subscribe('order.paid', async (event) => {
await notificationService.sendConfirmation(event.order);
});
eventBus.subscribe('order.paid', async (event) => {
await analyticsService.trackPurchase(event.order);
});
Scaling Strategies
Scaling is about handling growth without rewriting everything.
Database Scaling
| Strategy | When | Complexity |
|---|---|---|
| Vertical scaling | First step, up to ~64 cores | Low |
| Read replicas | Read-heavy workloads | Low |
| Connection pooling | Many app instances | Low |
| Partitioning | Time-series, multi-tenant | Medium |
| Sharding | Extreme scale | High |
// Read replica routing
class Database {
private primary: Pool;
private replicas: Pool[];
async query(sql: string, params: any[], options?: QueryOptions) {
const pool = options?.readOnly
? this.getRandomReplica()
: this.primary;
return pool.query(sql, params);
}
// Usage
const order = await db.query(
'SELECT * FROM orders WHERE id = $1',
[orderId],
{ readOnly: true } // Can use replica
);
await db.query(
'UPDATE orders SET status = $1 WHERE id = $2',
['shipped', orderId]
// No readOnly: uses primary
);
}
Application Scaling
| Pattern | Purpose | Implementation |
|---|---|---|
| Horizontal scaling | Handle more requests | Kubernetes HPA |
| Caching | Reduce database load | Redis, CDN |
| Async processing | Offload heavy work | Job queues |
| Circuit breakers | Handle failures gracefully | Resilience patterns |
// Caching strategy
async function getProduct(id: string): Promise<Product> {
// Check cache first
const cached = await cache.get(`product:${id}`);
if (cached) return cached;
// Cache miss: fetch from DB
const product = await db.query('SELECT * FROM products WHERE id = $1', [id]);
// Cache for future requests
await cache.set(`product:${id}`, product, { ttl: 3600 });
return product;
}
// Cache invalidation on update
async function updateProduct(id: string, data: Partial<Product>) {
await db.query('UPDATE products SET ...', [data, id]);
await cache.delete(`product:${id}`);
await cache.delete('product-list:*'); // Invalidate list caches
}
Infrastructure Patterns
How you run your software matters as much as how you write it.
Container Orchestration
Kubernetes has become the standard. Here's a production-ready setup:
# Deployment with best practices
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
containers:
- name: api
image: api:v1.2.3
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Multi-Region Deployment
For global availability and performance:
┌─────────────┐
│ Global │
│ DNS/CDN │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ EU-WEST │ │ US-EAST │ │ AP-SOUTH│
│ Region │ │ Region │ │ Region │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ DB │◄──────►│ DB │◄──────►│ DB │
│(Primary)│ │(Replica)│ │(Replica)│
└─────────┘ └─────────┘ └─────────┘
Monolith vs. Microservices
The eternal debate. Here's our take:
| Start with Monolith When | Consider Microservices When |
|---|---|
| New product, uncertain requirements | Clear domain boundaries exist |
| Small team (< 10 engineers) | Multiple teams need independence |
| Need to move fast | Different scaling requirements per service |
| Don't know your domains yet | Different technology needs per service |
The Modular Monolith
Best of both worlds: monolith deployment simplicity with service-like boundaries.
src/
├── modules/
│ ├── orders/ # Order domain
│ │ ├── api/ # HTTP handlers
│ │ ├── services/ # Business logic
│ │ ├── repository/ # Data access
│ │ └── events/ # Domain events
│ │
│ ├── inventory/ # Inventory domain
│ │ ├── api/
│ │ ├── services/
│ │ └── ...
│ │
│ └── customers/ # Customer domain
│ └── ...
│
├── shared/ # Cross-cutting concerns
│ ├── database/
│ ├── auth/
│ └── events/
│
└── main.ts # Single deployment
Modules communicate through well-defined interfaces. When you need to extract a service, the boundaries are already there.
Disaster Recovery
Systems fail. Plan for it.
| Level | RTO | RPO | Implementation |
|---|---|---|---|
| Basic | Hours | Hours | Daily backups, manual restore |
| Standard | 30 min | 15 min | Automated backups, standby DB |
| High Availability | Minutes | Minutes | Multi-AZ, automated failover |
| Mission Critical | Seconds | Near-zero | Multi-region, active-active |
# Backup configuration example
backups:
database:
frequency: "every 6 hours"
retention: "30 days"
type: "point-in-time recovery"
destination: "s3://backups/postgres"
application:
frequency: "every deployment"
retention: "90 days"
type: "container images"
destination: "ecr://app-images"
documents:
frequency: "real-time"
retention: "indefinite"
type: "object versioning"
destination: "s3://documents"
Conclusion
Good architecture isn't about following trends or implementing every pattern you've read about. It's about understanding your actual needs and making deliberate trade-offs.
The best architectures are the simplest ones that solve the problem. Complexity is a cost, not a feature.
We've designed systems handling millions of requests, processing real-time data, and serving global users. The common thread isn't sophisticated technology—it's thoughtful design that matches the actual requirements.
If you're facing architectural decisions or struggling with systems that don't scale, we'd be happy to discuss your specific situation.
Topics covered
Ready to implement agentic AI?
Our team specializes in building production-ready AI systems. Let's discuss how we can help you leverage agentic AI for your enterprise.
Start a conversation