Kubernetes for SaaS: When It's Right, When ECS Wins, and What We Chose
Kubernetes vs ECS vs Lambda for SaaS platforms. Multi-tenant isolation, deployment strategies, networking, cost optimization, and the honest decision framework from running all three in production.
The Kubernetes Decision
Every engineering team eventually asks: should we use Kubernetes? The honest answer is: it depends on what you're running, how many services you have, and whether you can afford the operational overhead.
We run three different compute strategies in production. One platform runs on Kubernetes (7+ services, Pimcore, OpenSearch, workers). Another runs on ECS Fargate + Lambda (serverless-first, event-driven). A third uses a mix of both. Each was the right choice for its context.
This article covers the decision framework and the implementation patterns for each approach. For how we manage the infrastructure as code behind these deployments, see our IaC guide. For the application architectures that run on top, see our system architecture guide.
The Honest Comparison
| Criteria | Kubernetes (EKS/AKS/GKE) | ECS Fargate | Lambda |
|---|---|---|---|
| Operational complexity | High (cluster upgrades, networking, RBAC) | Medium (task definitions, service mesh) | Low (just deploy functions) |
| Cold start | None (pods are always running) | None (tasks are always running) | 100ms-5s (depends on runtime/package) |
| Scaling speed | Minutes (pod scheduling + node scaling) | Seconds (task launch) | Milliseconds (concurrent invocations) |
| Cost at idle | High (minimum 2-3 nodes running always) | Medium (pay per running task) | Zero (pay per invocation) |
| Cost at scale | Low (efficient packing, spot instances) | Medium (less efficient packing) | Can be high (per-invocation pricing) |
| Stateful workloads | Good (PVCs, StatefulSets) | Limited (EFS only) | Not supported |
| Long-running processes | Unlimited | Unlimited | 15 min max |
| Ecosystem | Enormous (Helm, operators, service mesh) | AWS-native | AWS-native |
| Multi-cloud | Yes (same manifests, different providers) | AWS only | AWS only |
| Team skill requirement | High (K8s expertise needed) | Medium (AWS knowledge) | Low (just write functions) |
| Best for | Complex multi-service systems, stateful workloads | Simple microservices, containers without K8s overhead | Event-driven, API endpoints, scheduled tasks |
The Real Cost Breakdown
For a typical SaaS platform with 5 services:
| Component | Kubernetes (EKS) | ECS Fargate | Lambda + API Gateway |
|---|---|---|---|
| Compute (monthly) | ~$600 (3 nodes t3.large + pods) | ~$450 (5 services, 0.5 vCPU each) | ~$50-500 (depends on traffic) |
| Control plane | $73/month (EKS fee) | Free | Free |
| Load balancer | $25/month (ALB) | $25/month (ALB) | Included in API GW |
| Networking (NAT) | $45/month | $45/month | $45/month |
| Monitoring | $50-200/month | $50-200/month | $50-200/month |
| Total (low traffic) | ~$800-1,000/month | ~$570-720/month | ~$200-800/month |
| Total (high traffic) | ~$1,500-3,000/month | ~$2,000-4,000/month | ~$3,000-10,000/month |
Kubernetes is cheapest at scale (efficient bin-packing, spot instances, reserved capacity). Lambda is cheapest at low traffic (pay nothing at idle). ECS Fargate is the middle ground.
When to Choose Kubernetes
Choose Kubernetes when you have:
Complex multi-service systems. If you're running 7+ services with interdependencies, shared configuration, service discovery, and coordinated deployments, Kubernetes orchestrates this well. Individual Docker containers on ECS become hard to manage at this scale.
Stateful workloads. Databases, search engines (OpenSearch, MeiliSearch), message brokers (RabbitMQ), and cache clusters (Redis) all benefit from Kubernetes StatefulSets, PersistentVolumeClaims, and operators. Running these on ECS requires external managed services for every stateful component.
Multi-cloud requirements. Kubernetes manifests work on any cloud provider. ECS and Lambda are AWS-only. If you need to run on AWS and Azure (or might need to in the future), Kubernetes is the portable choice.
A platform team. Kubernetes requires ongoing maintenance: cluster upgrades (every 3-4 months for security patches), node group management, networking configuration (ingress controllers, network policies), and RBAC management. Without a dedicated person or team handling this, the operational overhead will slow the entire engineering organization.
Kubernetes Architecture for a PIM/Commerce Platform
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β Ingress β β Cert-Managerβ β External DNS β β
β β (Nginx/Traefik)β β (Let's Encrypt)β β (Route53 sync) β β
β ββββββββ¬ββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β β
β ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Namespaces β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β production namespace β β β
β β β β β β
β β β pimcore-web (2-4 replicas) β β β
β β β pimcore-worker (1-3 replicas) β β β
β β β pimcore-ops (1 replica, maintenance) β β β
β β β frontend (2-3 replicas) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β data namespace β β β
β β β β β β
β β β mysql (StatefulSet, 1 replica or managed) β β β
β β β redis (StatefulSet, 1 replica or managed) β β β
β β β opensearch (StatefulSet, 2-3 replicas) β β β
β β β rabbitmq (StatefulSet, 1-3 replicas) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β flux-system namespace (GitOps controller) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deployment Strategy: GitOps with Flux
We use Flux for GitOps-based deployments. The Git repository is the single source of truth. Flux reconciles the cluster state with the repository every minute.
# flux-system/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: platform
namespace: flux-system
spec:
interval: 1m
sourceRef:
kind: GitRepository
name: infrastructure
path: ./kubernetes/resources/overlay/prod
prune: true # Remove resources deleted from Git
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: pimcore
namespace: production
Benefits over kubectl apply or CI-driven deployments:
- Drift detection and correction. If someone changes a resource manually, Flux reverts it within 1 minute.
- Git as audit trail. Every change is a Git commit with author, timestamp, and diff.
- No cluster credentials in CI. Flux pulls from Git. CI pushes to Git. The CI pipeline never needs kubectl access.
- Rollback is git revert. Revert the commit, Flux reconciles, rollback complete.
Kustomize for Environment Overlays
kubernetes/resources/
βββ base/
β βββ deployments/
β β βββ pimcore.yaml
β β βββ frontend.yaml
β β βββ worker.yaml
β βββ services/
β βββ configmaps/
β βββ kustomization.yaml
βββ overlay/
β βββ prod/
β β βββ patches/
β β β βββ pimcore-replicas.yaml # 4 replicas
β β β βββ resource-limits.yaml # Higher CPU/memory
β β β βββ env-secrets.yaml # Production secrets
β β βββ kustomization.yaml
β βββ staging/
β β βββ patches/
β β β βββ pimcore-replicas.yaml # 1 replica
β β β βββ resource-limits.yaml # Lower limits
β β βββ kustomization.yaml
β βββ dev/
β βββ kustomization.yaml
Base manifests define the common structure. Overlays patch for environment-specific differences (replicas, resource limits, secrets, domains). Same application, different configuration per environment.
When ECS Fargate Wins
We chose ECS Fargate + Lambda for a commerce platform instead of Kubernetes. The reasons:
Simpler operations. No cluster upgrades, no node management, no RBAC configuration. ECS handles scheduling, scaling, and health checks. The team focuses on application code, not infrastructure.
Faster scaling. ECS Fargate launches new tasks in seconds. Kubernetes needs to schedule pods, potentially wait for node scaling (minutes), and pass health checks. For traffic spikes, Fargate responds faster.
Better cost for variable workloads. Pay per running task, not per node. If traffic drops to zero at night, costs drop proportionally. Kubernetes nodes keep running (and charging) regardless of load.
// ECS service definition (via CDK)
const service = new ecs.FargateService(this, 'ApiService', {
cluster,
taskDefinition,
desiredCount: 2,
assignPublicIp: false,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
circuitBreaker: { rollback: true }, // Auto-rollback on deployment failure
capacityProviderStrategies: [
{ capacityProvider: 'FARGATE_SPOT', weight: 2 }, // 66% spot
{ capacityProvider: 'FARGATE', weight: 1 }, // 33% on-demand
],
});
// Auto-scaling
const scaling = service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });
scaling.scaleOnRequestCount('RequestScaling', {
targetGroup,
requestsPerTarget: 1000,
});
Lambda for Event-Driven Workloads
Lambda functions handle event-driven workloads that don't justify a persistent service:
// Lambda for webhook processing
const webhookHandler = new lambda.Function(this, 'WebhookHandler', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'webhook.handler',
timeout: cdk.Duration.seconds(30),
memorySize: 256,
environment: {
TABLE_NAME: table.tableName,
QUEUE_URL: queue.queueUrl,
},
});
// API Gateway triggers Lambda
const api = new apigateway.RestApi(this, 'WebhookApi');
api.root.addResource('webhook').addMethod('POST',
new apigateway.LambdaIntegration(webhookHandler)
);
The Hybrid: ECS + Lambda
The architecture we use most often for commerce platforms:
| Component | Runs On | Why |
|---|---|---|
| Commerce API (Vendure) | ECS Fargate | Long-running, stateful sessions |
| Worker service | ECS Fargate | Persistent queue consumer |
| Webhook handlers | Lambda | Event-driven, sporadic traffic |
| Scheduled tasks | Lambda + EventBridge | Cron-like, no persistent process needed |
| Image processing | Lambda | CPU-intensive, parallelizable |
| Search indexing | Lambda + SQS | Event-driven, bursty |
| Admin dashboard | ECS Fargate or S3+CloudFront | Static assets or SSR |
The commerce API and workers run on Fargate (persistent, long-running). Everything event-driven runs on Lambda (pay-per-use, auto-scaling). The combination is cheaper than running everything on Fargate and simpler than running everything on Kubernetes.
Multi-Tenant Isolation on Kubernetes
If you run a multi-tenant SaaS on Kubernetes, tenant isolation needs explicit configuration:
Namespace Isolation
# Network policy: pods in tenant-a namespace can only talk to each other
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tenant-isolation
namespace: tenant-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
tenant: tenant-a
egress:
- to:
- namespaceSelector:
matchLabels:
tenant: tenant-a
- to: # Allow DNS resolution
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
Resource Quotas
Prevent one tenant from consuming all cluster resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-quota
namespace: tenant-a
spec:
hard:
requests.cpu: "4" # Max 4 CPU cores
requests.memory: "8Gi" # Max 8GB RAM
limits.cpu: "8"
limits.memory: "16Gi"
pods: "20" # Max 20 pods
services: "10"
persistentvolumeclaims: "5"
The Noisy Neighbor Problem
Even with resource quotas, one tenant's I/O-heavy workload can affect others on the same node. Solutions:
| Strategy | Isolation Level | Cost Impact |
|---|---|---|
| Shared nodes, resource quotas | Soft (CPU/memory limited, I/O shared) | Lowest |
| Node affinity (dedicated node pools) | Medium (dedicated nodes per tenant) | Higher |
| Dedicated clusters | Full (completely separate infrastructure) | Highest |
For most SaaS applications, shared nodes with resource quotas is sufficient. Reserve dedicated node pools for enterprise tenants with strict isolation requirements. For the application-level isolation patterns (API middleware, query filters, policies), see our multi-tenant design guide.
Cost Optimization
Spot Instances (Kubernetes)
Spot instances are 60-90% cheaper than on-demand. Use them for stateless workloads that can tolerate interruption:
# EKS managed node group with spot instances
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: eu-central-1
managedNodeGroups:
- name: spot-workers
instanceTypes: ["t3.large", "t3.xlarge", "m5.large"]
spot: true
minSize: 2
maxSize: 10
desiredCapacity: 3
labels:
node-type: spot
- name: on-demand-workers
instanceTypes: ["t3.large"]
minSize: 1
maxSize: 3
desiredCapacity: 1
labels:
node-type: on-demand
Run stateless services (web servers, workers) on spot. Run stateful services (databases, search engines) on on-demand. Use pod anti-affinity to spread replicas across nodes so a spot interruption doesn't take down all replicas.
Right-Sizing
Most teams over-provision. A service requesting 1 CPU and 2GB RAM might actually use 0.2 CPU and 400MB. Over-provisioning wastes money. Under-provisioning causes OOM kills.
# Check actual resource usage vs requests
kubectl top pods -n production
# Compare with resource requests in deployment manifests
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources.requests}{"\n"}{end}'
Use Vertical Pod Autoscaler (VPA) in recommendation mode to see what your pods actually need:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: pimcore-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: pimcore
updatePolicy:
updateMode: "Off" # Recommendation only, don't auto-apply
Autoscaling
Horizontal Pod Autoscaler (HPA) scales based on metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: pimcore-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pimcore
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 60 # Remove max 1 pod per minute
The stabilizationWindowSeconds prevents flapping (scale up, scale down, scale up). The scaleDown.policies prevent aggressive scale-down that might cause capacity issues during the next traffic spike.
The Networking Minefield
Kubernetes networking is where most teams get stuck.
Ingress Controllers
| Controller | Best For | Complexity |
|---|---|---|
| Nginx Ingress | General purpose, most common | Low |
| Traefik | Auto-discovery, Let's Encrypt built-in | Low |
| AWS ALB Ingress | AWS-native, WAF integration | Medium |
| Istio Gateway | Service mesh, mTLS, traffic management | High |
For most SaaS platforms, Nginx Ingress + cert-manager (Let's Encrypt) is sufficient. Add a service mesh (Istio, Linkerd) only if you need mTLS between services, advanced traffic routing (canary deployments, traffic splitting), or detailed service-to-service observability.
DNS Resolution Issues
A common production issue: pods can't resolve external hostnames because the DNS configuration is wrong.
# Find the correct DNS service IP in your cluster
kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
# If nginx configs reference a resolver, use this IP
# Common mistake: using 10.0.0.10 when the actual DNS is at 10.2.0.10
If your nginx sidecar proxies requests to external services (cloud storage, external APIs), the resolver directive must point to the cluster's kube-dns IP, not a hardcoded value.
Common Pitfalls
-
Choosing Kubernetes because "everyone uses it." If you have 3 services and a small team, ECS Fargate is simpler and cheaper. Kubernetes makes sense at 7+ services with a platform team.
-
No GitOps.
kubectl applyfrom a developer laptop is not a deployment strategy. Use Flux or ArgoCD for reconciliation-based deployments. -
Shared cluster without resource quotas. One tenant or one runaway pod consumes all resources. Every namespace needs resource quotas.
-
All pods on on-demand instances. Spot instances are 60-90% cheaper for stateless workloads. Use them for web servers and workers.
-
Over-provisioning resources. Pods requesting 2 CPU and using 0.2 CPU waste money. Use VPA recommendations to right-size.
-
Aggressive autoscaling. Scaling down too fast causes capacity issues on the next spike. Use stabilization windows and gradual scale-down policies.
-
No network policies. Without them, any pod can talk to any other pod in the cluster. In a multi-tenant setup, this is a security issue.
-
Ignoring cluster upgrades. Kubernetes versions go end-of-life every 12-15 months. Plan quarterly upgrade windows. Falling behind creates security vulnerabilities and blocks new features.
-
Mixing stateful and stateless on the same nodes. An OpenSearch pod and a web server pod competing for I/O on the same node degrades both. Use node affinity to separate them.
-
No sealed secrets. Committing plain secrets to Git is a security breach waiting to happen. Use Sealed Secrets, External Secrets Operator, or AWS Secrets Manager.
Key Takeaways
-
Kubernetes for complex multi-service platforms. 7+ services, stateful workloads, multi-cloud requirements, and a team that can handle the operational overhead.
-
ECS Fargate for simpler container workloads. Same containers, less operational complexity. Better for teams without Kubernetes expertise.
-
Lambda for event-driven workloads. Webhooks, scheduled tasks, image processing, and any workload that's bursty and short-lived. Zero cost at idle.
-
The hybrid (ECS + Lambda) is often the best answer. Persistent services on Fargate, event-driven work on Lambda. Cheaper than all-Kubernetes, simpler than all-Lambda.
-
GitOps with Flux gives real reconciliation. Not just deployment automation. Drift detection, audit trail, and rollback via git revert.
-
Spot instances save 60-90% on stateless workloads. Run web servers and workers on spot. Run databases and search engines on on-demand.
-
Multi-tenant isolation needs network policies and resource quotas. Namespace isolation alone is not enough. Enforce network boundaries and resource limits per tenant.
We deploy and manage Kubernetes, ECS, and Lambda infrastructure as part of our cloud services. If you need help choosing a compute strategy or optimizing your existing deployment, talk to our team or request a quote. See also our Pimcore upgrade guide for Kubernetes-specific deployment patterns with Pimcore.
Topics covered
Related Guides
Enterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideAgentic Commerce: How to Let AI Agents Buy Things Safely
How to design governed AI agent-initiated commerce. Policy engines, HITL approval gates, HMAC receipts, idempotency, tenant scoping, and the full Agentic Checkout Protocol.
Read guideThe 9 Places Your AI System Leaks Data (and How to Seal Each One)
A systematic map of every place data leaks in AI systems. Prompts, embeddings, logs, tool calls, agent memory, error messages, cache, fine-tuning data, and agent handoffs.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation