Technical Guide

Infrastructure as Code with 200 Resources: What Terraform Tutorials Don't Tell You

Production IaC patterns for real systems. State management at scale, module design, CDK + Terraform hybrid, drift detection, GitOps with Flux, and managing 30+ AWS services.

April 20, 202616 min readOronts Engineering Team

IaC Is Not "terraform init"

Every Terraform tutorial starts the same way: write a .tf file, run terraform init, run terraform apply, and watch your EC2 instance appear. That gets you from zero to one resource. It does not prepare you for managing 200+ resources across 30 AWS services with a team of engineers who all need to make infrastructure changes safely.

We manage infrastructure for multiple production systems ranging from Kubernetes clusters with Pimcore and OpenSearch to serverless architectures with Lambda, DynamoDB, and API Gateway. The patterns in this article are what survived production. For how we deploy applications on this infrastructure, see our cloud services page.

State Management at Scale

Terraform state is the single most critical file in your infrastructure. It maps your .tf files to real resources. Lose it, and Terraform doesn't know what exists. Corrupt it, and Terraform might destroy production resources.

Remote State (Non-Negotiable)

# backend.tf
terraform {
    backend "s3" {
        bucket         = "company-terraform-state"
        key            = "prod/platform/terraform.tfstate"
        region         = "eu-central-1"
        encrypt        = true
        dynamodb_table = "terraform-locks"
    }
}
RuleWhy
Remote state in S3 (or equivalent)Local state files get lost, can't be shared
Encryption at restState contains secrets (database passwords, API keys)
DynamoDB lockingPrevents two engineers from running apply simultaneously
Versioning on the S3 bucketRecover from state corruption by rolling back

State Organization

One large state file for everything is a maintenance disaster. Split by environment and domain:

terraform/
β”œβ”€β”€ environments/
β”‚   β”œβ”€β”€ prod/
β”‚   β”‚   β”œβ”€β”€ platform/       # EKS, VPC, networking
β”‚   β”‚   β”œβ”€β”€ databases/      # RDS, ElastiCache, OpenSearch
β”‚   β”‚   β”œβ”€β”€ compute/        # Lambda, ECS, Fargate
β”‚   β”‚   β”œβ”€β”€ storage/        # S3 buckets, CloudFront
β”‚   β”‚   └── monitoring/     # CloudWatch, alerts
β”‚   β”œβ”€β”€ staging/
β”‚   β”‚   └── (same structure)
β”‚   └── dev/
β”‚       └── (same structure)
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ vpc/
β”‚   β”œβ”€β”€ eks-cluster/
β”‚   β”œβ”€β”€ rds-postgres/
β”‚   β”œβ”€β”€ opensearch/
β”‚   β”œβ”€β”€ redis/
β”‚   └── lambda-function/
└── global/
    β”œβ”€β”€ iam/                 # IAM roles, policies
    β”œβ”€β”€ route53/             # DNS zones
    └── ecr/                 # Container registries

Each directory is a separate Terraform workspace with its own state file. Changes to networking don't risk breaking the database. Changes to monitoring don't require a plan that touches every resource.

Cross-State References

Workspaces need to reference each other. The VPC workspace outputs the VPC ID. The database workspace reads it:

# In databases/main.tf
data "terraform_remote_state" "platform" {
    backend = "s3"
    config = {
        bucket = "company-terraform-state"
        key    = "prod/platform/terraform.tfstate"
        region = "eu-central-1"
    }
}

resource "aws_db_instance" "main" {
    vpc_security_group_ids = [data.terraform_remote_state.platform.outputs.db_security_group_id]
    db_subnet_group_name   = data.terraform_remote_state.platform.outputs.db_subnet_group_name
}

Module Design

When to Extract a Module

Not every resource needs a module. Extract when:

  • The same pattern is used in 3+ places (DRY)
  • The resource group has a clear boundary (VPC, database cluster)
  • The configuration has sensible defaults that reduce duplication

Don't extract when:

  • It's used once (premature abstraction)
  • The module would have 20+ variables (too many knobs)
  • The abstraction hides important details (networking, security)

Module Interface Design

A good module has few required variables, sensible defaults, and clear outputs:

# modules/rds-postgres/variables.tf
variable "name" {
    description = "Database instance name"
    type        = string
}

variable "vpc_id" {
    description = "VPC to deploy into"
    type        = string
}

variable "subnet_ids" {
    description = "Subnets for the DB subnet group"
    type        = list(string)
}

variable "instance_class" {
    description = "RDS instance type"
    type        = string
    default     = "db.t3.medium"
}

variable "engine_version" {
    description = "PostgreSQL version"
    type        = string
    default     = "15.4"
}

variable "allocated_storage" {
    description = "Storage in GB"
    type        = number
    default     = 50
}

variable "multi_az" {
    description = "Enable multi-AZ deployment"
    type        = bool
    default     = false  # true for prod, false for staging/dev
}

The module consumer writes:

module "database" {
    source     = "../../modules/rds-postgres"
    name       = "pimcore-prod"
    vpc_id     = module.vpc.vpc_id
    subnet_ids = module.vpc.private_subnet_ids
    multi_az   = true
}

Five lines instead of fifty. The module handles security groups, parameter groups, subnet groups, encryption, backup retention, and monitoring.

CDK + Terraform: The Pragmatic Hybrid

Some teams go all-in on CDK. Others go all-in on Terraform. We use both, and it works.

Use CaseToolWhy
Networking, databases, clustersTerraformDeclarative, plan-before-apply, state management
Lambda functions + API GatewayCDKBetter Lambda bundling, API Gateway constructs
Complex IAM policiesCDKTypeScript logic for conditional policies
Kubernetes resourcesKustomize + FluxGitOps, reconciliation loops
Static infrastructureTerraformSimple, readable, well-understood

The boundary is clear: Terraform manages infrastructure that changes rarely (VPC, RDS, EKS cluster). CDK manages infrastructure that changes with application deployments (Lambda functions, API routes). Kustomize + Flux manages Kubernetes workloads.

They coexist by using outputs. Terraform outputs the VPC ID, cluster endpoint, and database connection string. CDK reads them from SSM Parameter Store or Terraform remote state.

The Drift Problem

Drift happens when someone changes infrastructure through the console (ClickOps), through a CLI command, or through another tool. The real state diverges from the Terraform state.

Detecting Drift

# Run plan regularly (CI, scheduled job)
terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes (state matches reality)
# 1 = error
# 2 = changes detected (drift!)

Run drift detection in CI on a schedule (daily for production, weekly for staging). Alert when drift is detected. Don't auto-remediate. Investigate first.

Common Drift Causes

CausePrevention
Console changes (ClickOps)Enforce "no console changes" policy. Use SCPs to restrict.
Auto-scaling changesIgnore auto-scaling attributes in Terraform (lifecycle { ignore_changes })
AWS service updatesPin provider versions. Update deliberately.
Another team's TerraformSeparate state files per team/domain.
Manual hotfix during incidentDocument the change. Apply it in Terraform after the incident.
# Ignore auto-scaling changes (expected drift)
resource "aws_ecs_service" "app" {
    desired_count = 2

    lifecycle {
        ignore_changes = [desired_count]  # Auto-scaling changes this
    }
}

GitOps with Flux

For Kubernetes workloads, we use Flux for GitOps. The reconciliation loop replaces kubectl apply with a pull-based model.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Git Repo    │────▢│  Flux       │────▢│  Kubernetes  β”‚
β”‚  (manifests) β”‚     β”‚  Controller β”‚     β”‚  Cluster     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β”‚ Reconciles every 1 min
                          β”‚ Detects drift
                          β”‚ Auto-applies
# flux-system/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
    name: platform
    namespace: flux-system
spec:
    interval: 1m
    sourceRef:
        kind: GitRepository
        name: infrastructure
    path: ./kubernetes/resources/overlay/prod
    prune: true
    healthChecks:
        - apiVersion: apps/v1
          kind: Deployment
          name: pimcore
          namespace: production

Flux polls the Git repo every minute. If manifests changed, it applies them. If someone changed a resource manually (drift), Flux reverts it to match Git. This is genuine reconciliation, not just deployment automation.

Sealed Secrets

Secrets can't go in Git as plaintext. Use Bitnami Sealed Secrets:

# Encrypt secret for the cluster
kubeseal --cert sealed-secrets.pem \
    -f secrets/database-secrets.yaml \
    -o yaml > secrets/database-secrets-sealed.yaml

# Commit the sealed version (safe in Git)
# Flux applies it, the controller decrypts it in-cluster

For how we handle secrets in Pimcore Kubernetes deployments specifically, see our Pimcore upgrade guide which covers the full deployment order.

Managing 30+ AWS Services

At enterprise scale, you're managing a lot of services. Organization matters.

Service Catalog

CategoryServicesTerraform Module?
NetworkingVPC, subnets, NAT, ALB, Route53Yes (vpc module)
ComputeEKS, ECS Fargate, LambdaYes (per service type)
DatabaseRDS PostgreSQL, DynamoDBYes (rds-postgres module)
CacheElastiCache RedisYes (redis module)
SearchOpenSearchYes (opensearch module)
StorageS3, EFSInline (simple enough)
CDNCloudFrontInline
MessagingSQS, MSK (Kafka), RabbitMQInline
AuthCognitoCDK (complex config)
MonitoringCloudWatch, X-RayInline
CI/CDECR, CodeBuildInline
SecurityIAM, KMS, Secrets ManagerGlobal workspace

Tagging Strategy

Every resource must be tagged for cost allocation, ownership, and lifecycle management:

locals {
    common_tags = {
        Environment = var.environment    # prod, staging, dev
        Project     = var.project_name   # pimcore, commerce, ai
        ManagedBy   = "terraform"
        Team        = var.team           # platform, backend, data
        CostCenter  = var.cost_center
    }
}

resource "aws_instance" "example" {
    tags = merge(local.common_tags, {
        Name = "pimcore-web-1"
        Role = "web"
    })
}

Filter AWS Cost Explorer by Project tag to see exactly how much each system costs. Filter by ManagedBy to find resources created manually (not tagged as "terraform").

Common Pitfalls

  1. One state file for everything. A terraform plan that touches 200 resources takes minutes and one mistake affects everything. Split by domain and environment.

  2. No state locking. Two engineers run terraform apply simultaneously. One's changes are lost or the state is corrupted. Use DynamoDB locking.

  3. Modules with 20+ variables. If your module interface is as complex as the raw resources, the abstraction adds no value. Keep module interfaces small.

  4. Auto-remediating drift. Detecting drift is good. Automatically fixing it is dangerous. The "drift" might be a valid hotfix during an incident. Investigate before reverting.

  5. Secrets in state files. Terraform state contains every attribute of every resource, including database passwords. Encrypt state at rest and restrict access.

  6. No provider version pinning. A provider update changes resource behavior. Pin versions in required_providers and update deliberately.

  7. ClickOps for "just this one thing." Console changes create drift that's invisible until the next terraform plan. Enforce infrastructure-as-code for everything.

  8. No tagging. Without tags, you can't attribute costs, identify ownership, or find manually created resources.

Key Takeaways

  • Split state by domain and environment. Networking, databases, compute, and monitoring should be separate workspaces. Changes to one shouldn't risk another.

  • Modules are for patterns, not abstraction. Extract when the same resource group appears 3+ times. Don't create modules for one-time resources.

  • CDK + Terraform is pragmatic. Terraform for static infrastructure, CDK for Lambda/API Gateway, Kustomize + Flux for Kubernetes. Each tool where it's strongest.

  • Drift detection is a scheduled job. Run terraform plan daily in CI. Alert on drift. Investigate before remediating.

  • GitOps with Flux gives real reconciliation. Not just deployment automation. Flux detects and reverts manual changes. Sealed Secrets keep credentials safe in Git.

  • Tag everything. Environment, project, team, cost center, managed-by. Without tags, cost attribution and resource auditing are impossible.

We manage infrastructure for cloud deployments, custom software platforms, and data engineering systems. If you need help with IaC at scale, talk to our team or request a quote.

Topics covered

Terraform productionIaC lessonsinfrastructure as code realTerraform AWSCDK vs TerraformTerraform state managementGitOps FluxTerraform modules

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation