Infrastructure as Code with 200 Resources: What Terraform Tutorials Don't Tell You
Production IaC patterns for real systems. State management at scale, module design, CDK + Terraform hybrid, drift detection, GitOps with Flux, and managing 30+ AWS services.
IaC Is Not "terraform init"
Every Terraform tutorial starts the same way: write a .tf file, run terraform init, run terraform apply, and watch your EC2 instance appear. That gets you from zero to one resource. It does not prepare you for managing 200+ resources across 30 AWS services with a team of engineers who all need to make infrastructure changes safely.
We manage infrastructure for multiple production systems ranging from Kubernetes clusters with Pimcore and OpenSearch to serverless architectures with Lambda, DynamoDB, and API Gateway. The patterns in this article are what survived production. For how we deploy applications on this infrastructure, see our cloud services page.
State Management at Scale
Terraform state is the single most critical file in your infrastructure. It maps your .tf files to real resources. Lose it, and Terraform doesn't know what exists. Corrupt it, and Terraform might destroy production resources.
Remote State (Non-Negotiable)
# backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/platform/terraform.tfstate"
region = "eu-central-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
| Rule | Why |
|---|---|
| Remote state in S3 (or equivalent) | Local state files get lost, can't be shared |
| Encryption at rest | State contains secrets (database passwords, API keys) |
| DynamoDB locking | Prevents two engineers from running apply simultaneously |
| Versioning on the S3 bucket | Recover from state corruption by rolling back |
State Organization
One large state file for everything is a maintenance disaster. Split by environment and domain:
terraform/
βββ environments/
β βββ prod/
β β βββ platform/ # EKS, VPC, networking
β β βββ databases/ # RDS, ElastiCache, OpenSearch
β β βββ compute/ # Lambda, ECS, Fargate
β β βββ storage/ # S3 buckets, CloudFront
β β βββ monitoring/ # CloudWatch, alerts
β βββ staging/
β β βββ (same structure)
β βββ dev/
β βββ (same structure)
βββ modules/
β βββ vpc/
β βββ eks-cluster/
β βββ rds-postgres/
β βββ opensearch/
β βββ redis/
β βββ lambda-function/
βββ global/
βββ iam/ # IAM roles, policies
βββ route53/ # DNS zones
βββ ecr/ # Container registries
Each directory is a separate Terraform workspace with its own state file. Changes to networking don't risk breaking the database. Changes to monitoring don't require a plan that touches every resource.
Cross-State References
Workspaces need to reference each other. The VPC workspace outputs the VPC ID. The database workspace reads it:
# In databases/main.tf
data "terraform_remote_state" "platform" {
backend = "s3"
config = {
bucket = "company-terraform-state"
key = "prod/platform/terraform.tfstate"
region = "eu-central-1"
}
}
resource "aws_db_instance" "main" {
vpc_security_group_ids = [data.terraform_remote_state.platform.outputs.db_security_group_id]
db_subnet_group_name = data.terraform_remote_state.platform.outputs.db_subnet_group_name
}
Module Design
When to Extract a Module
Not every resource needs a module. Extract when:
- The same pattern is used in 3+ places (DRY)
- The resource group has a clear boundary (VPC, database cluster)
- The configuration has sensible defaults that reduce duplication
Don't extract when:
- It's used once (premature abstraction)
- The module would have 20+ variables (too many knobs)
- The abstraction hides important details (networking, security)
Module Interface Design
A good module has few required variables, sensible defaults, and clear outputs:
# modules/rds-postgres/variables.tf
variable "name" {
description = "Database instance name"
type = string
}
variable "vpc_id" {
description = "VPC to deploy into"
type = string
}
variable "subnet_ids" {
description = "Subnets for the DB subnet group"
type = list(string)
}
variable "instance_class" {
description = "RDS instance type"
type = string
default = "db.t3.medium"
}
variable "engine_version" {
description = "PostgreSQL version"
type = string
default = "15.4"
}
variable "allocated_storage" {
description = "Storage in GB"
type = number
default = 50
}
variable "multi_az" {
description = "Enable multi-AZ deployment"
type = bool
default = false # true for prod, false for staging/dev
}
The module consumer writes:
module "database" {
source = "../../modules/rds-postgres"
name = "pimcore-prod"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
multi_az = true
}
Five lines instead of fifty. The module handles security groups, parameter groups, subnet groups, encryption, backup retention, and monitoring.
CDK + Terraform: The Pragmatic Hybrid
Some teams go all-in on CDK. Others go all-in on Terraform. We use both, and it works.
| Use Case | Tool | Why |
|---|---|---|
| Networking, databases, clusters | Terraform | Declarative, plan-before-apply, state management |
| Lambda functions + API Gateway | CDK | Better Lambda bundling, API Gateway constructs |
| Complex IAM policies | CDK | TypeScript logic for conditional policies |
| Kubernetes resources | Kustomize + Flux | GitOps, reconciliation loops |
| Static infrastructure | Terraform | Simple, readable, well-understood |
The boundary is clear: Terraform manages infrastructure that changes rarely (VPC, RDS, EKS cluster). CDK manages infrastructure that changes with application deployments (Lambda functions, API routes). Kustomize + Flux manages Kubernetes workloads.
They coexist by using outputs. Terraform outputs the VPC ID, cluster endpoint, and database connection string. CDK reads them from SSM Parameter Store or Terraform remote state.
The Drift Problem
Drift happens when someone changes infrastructure through the console (ClickOps), through a CLI command, or through another tool. The real state diverges from the Terraform state.
Detecting Drift
# Run plan regularly (CI, scheduled job)
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes (state matches reality)
# 1 = error
# 2 = changes detected (drift!)
Run drift detection in CI on a schedule (daily for production, weekly for staging). Alert when drift is detected. Don't auto-remediate. Investigate first.
Common Drift Causes
| Cause | Prevention |
|---|---|
| Console changes (ClickOps) | Enforce "no console changes" policy. Use SCPs to restrict. |
| Auto-scaling changes | Ignore auto-scaling attributes in Terraform (lifecycle { ignore_changes }) |
| AWS service updates | Pin provider versions. Update deliberately. |
| Another team's Terraform | Separate state files per team/domain. |
| Manual hotfix during incident | Document the change. Apply it in Terraform after the incident. |
# Ignore auto-scaling changes (expected drift)
resource "aws_ecs_service" "app" {
desired_count = 2
lifecycle {
ignore_changes = [desired_count] # Auto-scaling changes this
}
}
GitOps with Flux
For Kubernetes workloads, we use Flux for GitOps. The reconciliation loop replaces kubectl apply with a pull-based model.
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Git Repo ββββββΆβ Flux ββββββΆβ Kubernetes β
β (manifests) β β Controller β β Cluster β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
β Reconciles every 1 min
β Detects drift
β Auto-applies
# flux-system/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: platform
namespace: flux-system
spec:
interval: 1m
sourceRef:
kind: GitRepository
name: infrastructure
path: ./kubernetes/resources/overlay/prod
prune: true
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: pimcore
namespace: production
Flux polls the Git repo every minute. If manifests changed, it applies them. If someone changed a resource manually (drift), Flux reverts it to match Git. This is genuine reconciliation, not just deployment automation.
Sealed Secrets
Secrets can't go in Git as plaintext. Use Bitnami Sealed Secrets:
# Encrypt secret for the cluster
kubeseal --cert sealed-secrets.pem \
-f secrets/database-secrets.yaml \
-o yaml > secrets/database-secrets-sealed.yaml
# Commit the sealed version (safe in Git)
# Flux applies it, the controller decrypts it in-cluster
For how we handle secrets in Pimcore Kubernetes deployments specifically, see our Pimcore upgrade guide which covers the full deployment order.
Managing 30+ AWS Services
At enterprise scale, you're managing a lot of services. Organization matters.
Service Catalog
| Category | Services | Terraform Module? |
|---|---|---|
| Networking | VPC, subnets, NAT, ALB, Route53 | Yes (vpc module) |
| Compute | EKS, ECS Fargate, Lambda | Yes (per service type) |
| Database | RDS PostgreSQL, DynamoDB | Yes (rds-postgres module) |
| Cache | ElastiCache Redis | Yes (redis module) |
| Search | OpenSearch | Yes (opensearch module) |
| Storage | S3, EFS | Inline (simple enough) |
| CDN | CloudFront | Inline |
| Messaging | SQS, MSK (Kafka), RabbitMQ | Inline |
| Auth | Cognito | CDK (complex config) |
| Monitoring | CloudWatch, X-Ray | Inline |
| CI/CD | ECR, CodeBuild | Inline |
| Security | IAM, KMS, Secrets Manager | Global workspace |
Tagging Strategy
Every resource must be tagged for cost allocation, ownership, and lifecycle management:
locals {
common_tags = {
Environment = var.environment # prod, staging, dev
Project = var.project_name # pimcore, commerce, ai
ManagedBy = "terraform"
Team = var.team # platform, backend, data
CostCenter = var.cost_center
}
}
resource "aws_instance" "example" {
tags = merge(local.common_tags, {
Name = "pimcore-web-1"
Role = "web"
})
}
Filter AWS Cost Explorer by Project tag to see exactly how much each system costs. Filter by ManagedBy to find resources created manually (not tagged as "terraform").
Common Pitfalls
-
One state file for everything. A
terraform planthat touches 200 resources takes minutes and one mistake affects everything. Split by domain and environment. -
No state locking. Two engineers run
terraform applysimultaneously. One's changes are lost or the state is corrupted. Use DynamoDB locking. -
Modules with 20+ variables. If your module interface is as complex as the raw resources, the abstraction adds no value. Keep module interfaces small.
-
Auto-remediating drift. Detecting drift is good. Automatically fixing it is dangerous. The "drift" might be a valid hotfix during an incident. Investigate before reverting.
-
Secrets in state files. Terraform state contains every attribute of every resource, including database passwords. Encrypt state at rest and restrict access.
-
No provider version pinning. A provider update changes resource behavior. Pin versions in
required_providersand update deliberately. -
ClickOps for "just this one thing." Console changes create drift that's invisible until the next
terraform plan. Enforce infrastructure-as-code for everything. -
No tagging. Without tags, you can't attribute costs, identify ownership, or find manually created resources.
Key Takeaways
-
Split state by domain and environment. Networking, databases, compute, and monitoring should be separate workspaces. Changes to one shouldn't risk another.
-
Modules are for patterns, not abstraction. Extract when the same resource group appears 3+ times. Don't create modules for one-time resources.
-
CDK + Terraform is pragmatic. Terraform for static infrastructure, CDK for Lambda/API Gateway, Kustomize + Flux for Kubernetes. Each tool where it's strongest.
-
Drift detection is a scheduled job. Run
terraform plandaily in CI. Alert on drift. Investigate before remediating. -
GitOps with Flux gives real reconciliation. Not just deployment automation. Flux detects and reverts manual changes. Sealed Secrets keep credentials safe in Git.
-
Tag everything. Environment, project, team, cost center, managed-by. Without tags, cost attribution and resource auditing are impossible.
We manage infrastructure for cloud deployments, custom software platforms, and data engineering systems. If you need help with IaC at scale, talk to our team or request a quote.
Topics covered
Related Guides
Enterprise Guide to Agentic AI Systems
Technical guide to agentic AI systems in enterprise environments. Learn the architecture, capabilities, and applications of autonomous AI agents.
Read guideAgentic Commerce: How to Let AI Agents Buy Things Safely
How to design governed AI agent-initiated commerce. Policy engines, HITL approval gates, HMAC receipts, idempotency, tenant scoping, and the full Agentic Checkout Protocol.
Read guideThe 9 Places Your AI System Leaks Data (and How to Seal Each One)
A systematic map of every place data leaks in AI systems. Prompts, embeddings, logs, tool calls, agent memory, error messages, cache, fine-tuning data, and agent handoffs.
Read guideReady to build production AI systems?
Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.
Start a conversation