Technical Guide

Infrastructure as Code with 200 Resources: What Terraform Tutorials Don't Tell You

Production IaC patterns for real systems. State management at scale, module design, CDK + Terraform hybrid, drift detection, GitOps with Flux, and managing 30+ AWS services.

April 20, 202616 min readOronts Engineering Team

IaC Is Not "terraform init"

Every Terraform tutorial starts the same way: write a .tf file, run terraform init, run terraform apply, and watch your EC2 instance appear. That gets you from zero to one resource. It does not prepare you for managing 200+ resources across 30 AWS services with a team of engineers who all need to make infrastructure changes safely.

We manage infrastructure for multiple production systems ranging from Kubernetes clusters with Pimcore and OpenSearch to serverless architectures with Lambda, DynamoDB, and API Gateway. The patterns in this article are what survived production. For how we deploy applications on this infrastructure, see our cloud services page.

State Management at Scale

Terraform state is the single most critical file in your infrastructure. It maps your .tf files to real resources. Lose it, and Terraform doesn't know what exists. Corrupt it, and Terraform might destroy production resources.

Remote State (Non-Negotiable)

# backend.tf
terraform {
    backend "s3" {
        bucket         = "company-terraform-state"
        key            = "prod/platform/terraform.tfstate"
        region         = "eu-central-1"
        encrypt        = true
        dynamodb_table = "terraform-locks"
    }
}

Rule	Why
Remote state in S3 (or equivalent)	Local state files get lost, can't be shared
Encryption at rest	State contains secrets (database passwords, API keys)
DynamoDB locking	Prevents two engineers from running apply simultaneously
Versioning on the S3 bucket	Recover from state corruption by rolling back

State Organization

One large state file for everything is a maintenance disaster. Split by environment and domain:

terraform/
├── environments/
│   ├── prod/
│   │   ├── platform/       # EKS, VPC, networking
│   │   ├── databases/      # RDS, ElastiCache, OpenSearch
│   │   ├── compute/        # Lambda, ECS, Fargate
│   │   ├── storage/        # S3 buckets, CloudFront
│   │   └── monitoring/     # CloudWatch, alerts
│   ├── staging/
│   │   └── (same structure)
│   └── dev/
│       └── (same structure)
├── modules/
│   ├── vpc/
│   ├── eks-cluster/
│   ├── rds-postgres/
│   ├── opensearch/
│   ├── redis/
│   └── lambda-function/
└── global/
    ├── iam/                 # IAM roles, policies
    ├── route53/             # DNS zones
    └── ecr/                 # Container registries

Each directory is a separate Terraform workspace with its own state file. Changes to networking don't risk breaking the database. Changes to monitoring don't require a plan that touches every resource.

Cross-State References

Workspaces need to reference each other. The VPC workspace outputs the VPC ID. The database workspace reads it:

# In databases/main.tf
data "terraform_remote_state" "platform" {
    backend = "s3"
    config = {
        bucket = "company-terraform-state"
        key    = "prod/platform/terraform.tfstate"
        region = "eu-central-1"
    }
}

resource "aws_db_instance" "main" {
    vpc_security_group_ids = [data.terraform_remote_state.platform.outputs.db_security_group_id]
    db_subnet_group_name   = data.terraform_remote_state.platform.outputs.db_subnet_group_name
}

Module Design

When to Extract a Module

Not every resource needs a module. Extract when:

The same pattern is used in 3+ places (DRY)
The resource group has a clear boundary (VPC, database cluster)
The configuration has sensible defaults that reduce duplication

Don't extract when:

It's used once (premature abstraction)
The module would have 20+ variables (too many knobs)
The abstraction hides important details (networking, security)

Module Interface Design

A good module has few required variables, sensible defaults, and clear outputs:

# modules/rds-postgres/variables.tf
variable "name" {
    description = "Database instance name"
    type        = string
}

variable "vpc_id" {
    description = "VPC to deploy into"
    type        = string
}

variable "subnet_ids" {
    description = "Subnets for the DB subnet group"
    type        = list(string)
}

variable "instance_class" {
    description = "RDS instance type"
    type        = string
    default     = "db.t3.medium"
}

variable "engine_version" {
    description = "PostgreSQL version"
    type        = string
    default     = "15.4"
}

variable "allocated_storage" {
    description = "Storage in GB"
    type        = number
    default     = 50
}

variable "multi_az" {
    description = "Enable multi-AZ deployment"
    type        = bool
    default     = false  # true for prod, false for staging/dev
}

The module consumer writes:

module "database" {
    source     = "../../modules/rds-postgres"
    name       = "pimcore-prod"
    vpc_id     = module.vpc.vpc_id
    subnet_ids = module.vpc.private_subnet_ids
    multi_az   = true
}

Five lines instead of fifty. The module handles security groups, parameter groups, subnet groups, encryption, backup retention, and monitoring.

CDK + Terraform: The Pragmatic Hybrid

Some teams go all-in on CDK. Others go all-in on Terraform. We use both, and it works.

Use Case	Tool	Why
Networking, databases, clusters	Terraform	Declarative, plan-before-apply, state management
Lambda functions + API Gateway	CDK	Better Lambda bundling, API Gateway constructs
Complex IAM policies	CDK	TypeScript logic for conditional policies
Kubernetes resources	Kustomize + Flux	GitOps, reconciliation loops
Static infrastructure	Terraform	Simple, readable, well-understood

The boundary is clear: Terraform manages infrastructure that changes rarely (VPC, RDS, EKS cluster). CDK manages infrastructure that changes with application deployments (Lambda functions, API routes). Kustomize + Flux manages Kubernetes workloads.

They coexist by using outputs. Terraform outputs the VPC ID, cluster endpoint, and database connection string. CDK reads them from SSM Parameter Store or Terraform remote state.

The Drift Problem

Drift happens when someone changes infrastructure through the console (ClickOps), through a CLI command, or through another tool. The real state diverges from the Terraform state.

Detecting Drift

# Run plan regularly (CI, scheduled job)
terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes (state matches reality)
# 1 = error
# 2 = changes detected (drift!)

Run drift detection in CI on a schedule (daily for production, weekly for staging). Alert when drift is detected. Don't auto-remediate. Investigate first.

Common Drift Causes

Cause	Prevention
Console changes (ClickOps)	Enforce "no console changes" policy. Use SCPs to restrict.
Auto-scaling changes	Ignore auto-scaling attributes in Terraform (`lifecycle { ignore_changes }`)
AWS service updates	Pin provider versions. Update deliberately.
Another team's Terraform	Separate state files per team/domain.
Manual hotfix during incident	Document the change. Apply it in Terraform after the incident.

# Ignore auto-scaling changes (expected drift)
resource "aws_ecs_service" "app" {
    desired_count = 2

    lifecycle {
        ignore_changes = [desired_count]  # Auto-scaling changes this
    }
}

GitOps with Flux

For Kubernetes workloads, we use Flux for GitOps. The reconciliation loop replaces kubectl apply with a pull-based model.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Git Repo    │────▶│  Flux       │────▶│  Kubernetes  │
│  (manifests) │     │  Controller │     │  Cluster     │
└─────────────┘     └─────────────┘     └─────────────┘
                          │
                          │ Reconciles every 1 min
                          │ Detects drift
                          │ Auto-applies

# flux-system/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
    name: platform
    namespace: flux-system
spec:
    interval: 1m
    sourceRef:
        kind: GitRepository
        name: infrastructure
    path: ./kubernetes/resources/overlay/prod
    prune: true
    healthChecks:
        - apiVersion: apps/v1
          kind: Deployment
          name: pimcore
          namespace: production

Flux polls the Git repo every minute. If manifests changed, it applies them. If someone changed a resource manually (drift), Flux reverts it to match Git. This is genuine reconciliation, not just deployment automation.

Sealed Secrets

Secrets can't go in Git as plaintext. Use Bitnami Sealed Secrets:

# Encrypt secret for the cluster
kubeseal --cert sealed-secrets.pem \
    -f secrets/database-secrets.yaml \
    -o yaml > secrets/database-secrets-sealed.yaml

# Commit the sealed version (safe in Git)
# Flux applies it, the controller decrypts it in-cluster

For how we handle secrets in Pimcore Kubernetes deployments specifically, see our Pimcore upgrade guide which covers the full deployment order.

Managing 30+ AWS Services

At enterprise scale, you're managing a lot of services. Organization matters.

Service Catalog

Category	Services	Terraform Module?
Networking	VPC, subnets, NAT, ALB, Route53	Yes (vpc module)
Compute	EKS, ECS Fargate, Lambda	Yes (per service type)
Database	RDS PostgreSQL, DynamoDB	Yes (rds-postgres module)
Cache	ElastiCache Redis	Yes (redis module)
Search	OpenSearch	Yes (opensearch module)
Storage	S3, EFS	Inline (simple enough)
CDN	CloudFront	Inline
Messaging	SQS, MSK (Kafka), RabbitMQ	Inline
Auth	Cognito	CDK (complex config)
Monitoring	CloudWatch, X-Ray	Inline
CI/CD	ECR, CodeBuild	Inline
Security	IAM, KMS, Secrets Manager	Global workspace

Tagging Strategy

Every resource must be tagged for cost allocation, ownership, and lifecycle management:

locals {
    common_tags = {
        Environment = var.environment    # prod, staging, dev
        Project     = var.project_name   # pimcore, commerce, ai
        ManagedBy   = "terraform"
        Team        = var.team           # platform, backend, data
        CostCenter  = var.cost_center
    }
}

resource "aws_instance" "example" {
    tags = merge(local.common_tags, {
        Name = "pimcore-web-1"
        Role = "web"
    })
}

Filter AWS Cost Explorer by Project tag to see exactly how much each system costs. Filter by ManagedBy to find resources created manually (not tagged as "terraform").

Common Pitfalls

One state file for everything. A terraform plan that touches 200 resources takes minutes and one mistake affects everything. Split by domain and environment.
No state locking. Two engineers run terraform apply simultaneously. One's changes are lost or the state is corrupted. Use DynamoDB locking.
Modules with 20+ variables. If your module interface is as complex as the raw resources, the abstraction adds no value. Keep module interfaces small.
Auto-remediating drift. Detecting drift is good. Automatically fixing it is dangerous. The "drift" might be a valid hotfix during an incident. Investigate before reverting.
Secrets in state files. Terraform state contains every attribute of every resource, including database passwords. Encrypt state at rest and restrict access.
No provider version pinning. A provider update changes resource behavior. Pin versions in required_providers and update deliberately.
ClickOps for "just this one thing." Console changes create drift that's invisible until the next terraform plan. Enforce infrastructure-as-code for everything.
No tagging. Without tags, you can't attribute costs, identify ownership, or find manually created resources.

Key Takeaways

Split state by domain and environment. Networking, databases, compute, and monitoring should be separate workspaces. Changes to one shouldn't risk another.
Modules are for patterns, not abstraction. Extract when the same resource group appears 3+ times. Don't create modules for one-time resources.
CDK + Terraform is pragmatic. Terraform for static infrastructure, CDK for Lambda/API Gateway, Kustomize + Flux for Kubernetes. Each tool where it's strongest.
Drift detection is a scheduled job. Run terraform plan daily in CI. Alert on drift. Investigate before remediating.
GitOps with Flux gives real reconciliation. Not just deployment automation. Flux detects and reverts manual changes. Sealed Secrets keep credentials safe in Git.
Tag everything. Environment, project, team, cost center, managed-by. Without tags, cost attribution and resource auditing are impossible.

We manage infrastructure for cloud deployments, custom software platforms, and data engineering systems. If you need help with IaC at scale, talk to our team or request a quote.

Topics covered

Terraform productionIaC lessonsinfrastructure as code realTerraform AWSCDK vs TerraformTerraform state managementGitOps FluxTerraform modules

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

Infrastructure as Code with 200 Resources: What Terraform Tutorials Don't Tell You

IaC Is Not "terraform init"

State Management at Scale

Remote State (Non-Negotiable)

State Organization

Cross-State References

Module Design

When to Extract a Module

Module Interface Design

CDK + Terraform: The Pragmatic Hybrid

The Drift Problem

Detecting Drift

Common Drift Causes

GitOps with Flux

Sealed Secrets

Managing 30+ AWS Services

Service Catalog

Tagging Strategy

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

Enterprise Guide to Agentic AI Systems

Agentic Commerce: How to Let AI Agents Buy Things Safely

The 9 Places Your AI System Leaks Data (and How to Seal Each One)

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal