Infrastructure as Code for Data Platforms
Infrastructure as Code for Data Platforms
The discipline of Infrastructure as Code transformed how we manage compute and networking. Data platforms have been slower to adopt these practices — data pipelines lived in Jupyter notebooks, schema changes were applied manually, and "environment promotion" meant copying SQL files between folders. That era is over.
This guide covers how to apply rigorous IaC principles to modern data platforms: from the underlying cloud resources to the schemas, pipelines, and governance policies that run on top of them.
The Data Platform IaC Stack
A complete IaC approach for data platforms operates at four layers:
Loading diagram...
Most teams have L1 covered with Terraform. L2 is where things get interesting. L3 and L4 are where they usually fall down.
Layer 1: Cloud Resources with Terraform
Module Structure for Data Platform Infrastructure
terraform/
├── modules/
│ ├── data-lake/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── streaming-platform/
│ ├── data-warehouse/
│ └── governance/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
└── shared/
├── backend.tf
└── providers.tf
Data Lake Module Example
# modules/data-lake/main.tf
locals {
bucket_name = "${var.project_prefix}-${var.environment}-lakehouse"
common_tags = merge(var.tags, {
Module = "data-lake"
Environment = var.environment
ManagedBy = "terraform"
})
}
resource "aws_s3_bucket" "lakehouse" {
bucket = local.bucket_name
tags = local.common_tags
}
resource "aws_s3_bucket_versioning" "lakehouse" {
bucket = aws_s3_bucket.lakehouse.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "lakehouse" {
bucket = aws_s3_bucket.lakehouse.id
rule {
id = "bronze-to-glacier"
status = "Enabled"
filter { prefix = "bronze/" }
transition {
days = 90
storage_class = "GLACIER_IR"
}
}
rule {
id = "silver-intelligent-tiering"
status = "Enabled"
filter { prefix = "silver/" }
transition {
days = 30
storage_class = "INTELLIGENT_TIERING"
}
}
}
# Lake Formation permissions
resource "aws_lakeformation_resource" "lakehouse" {
arn = aws_s3_bucket.lakehouse.arn
role_arn = aws_iam_role.lakeformation_service.arn
}
resource "aws_lakeformation_permissions" "analyst_access" {
for_each = toset(var.analyst_role_arns)
principal = each.value
permissions = ["SELECT", "DESCRIBE"]
table {
database_name = aws_glue_catalog_database.silver.name
wildcard = true
}
}
Glue Catalog as Code
# modules/data-lake/catalog.tf
resource "aws_glue_catalog_database" "bronze" {
name = "${var.project_prefix}_bronze"
description = "Raw ingested data — immutable, append-only"
create_table_default_permission {
permissions = ["ALL"]
principal {
data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
}
}
}
resource "aws_glue_catalog_database" "silver" {
name = "${var.project_prefix}_silver"
description = "Cleaned, validated, conformed data"
}
resource "aws_glue_catalog_database" "gold" {
name = "${var.project_prefix}_gold"
description = "Business-ready aggregates and data products"
}
Layer 2: Data Warehouse Infrastructure
Databricks Workspace with Terraform
# modules/databricks-workspace/main.tf
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "~> 1.38"
}
}
}
resource "databricks_cluster_policy" "data_engineering" {
name = "data-engineering-${var.environment}"
definition = jsonencode({
"spark_version" : {
"type" : "allowlist",
"values" : ["13.3.x-scala2.12", "14.3.x-scala2.12"],
"defaultValue" : "14.3.x-scala2.12"
},
"node_type_id" : {
"type" : "allowlist",
"values" : ["m5d.xlarge", "m5d.2xlarge", "m5d.4xlarge"]
},
"autotermination_minutes" : {
"type" : "fixed",
"value" : 60,
"hidden" : false
},
"custom_tags.team" : {
"type" : "fixed",
"value" : var.team_tag
}
})
}
resource "databricks_sql_warehouse" "main" {
name = "${var.project_prefix}-${var.environment}"
cluster_size = var.environment == "prod" ? "Medium" : "X-Small"
auto_stop_mins = 5
min_num_clusters = 1
max_num_clusters = var.environment == "prod" ? 3 : 1
tags {
custom_tags {
key = "environment"
value = var.environment
}
}
}
resource "databricks_catalog" "main" {
name = var.catalog_name
comment = "Main Unity Catalog for ${var.environment}"
properties = {
purpose = "data-platform"
}
}
Layer 3: Schema-as-Code with dbt
Schema management deserves the same rigor as infrastructure. dbt provides this for transformation logic; combine it with schema migrations for your operational databases.
dbt Project Structure for Platform Teams
dbt_project/
├── dbt_project.yml
├── profiles.yml # managed via Vault / environment vars
├── models/
│ ├── staging/ # raw → typed, renamed
│ ├── intermediate/ # business logic
│ └── marts/ # data products for consumers
├── tests/
│ ├── generic/ # reusable test macros
│ └── singular/ # one-off assertions
├── macros/
├── seeds/ # small reference tables, version-controlled
└── analyses/ # ad-hoc, not materialised
dbt Model with Data Contract
# models/marts/finance/schema.yml
version: 2
models:
- name: fct_revenue
description: "Daily revenue fact table — SLA: 99.9% freshness within 2h of close"
config:
contract:
enforced: true # dbt will fail if types don't match
columns:
- name: revenue_id
data_type: varchar
constraints:
- type: not_null
- type: unique
- name: amount_usd
data_type: numeric(18,4)
constraints:
- type: not_null
- name: transaction_date
data_type: date
constraints:
- type: not_null
tests:
- dbt_expectations.expect_column_values_to_be_between:
column_name: amount_usd
min_value: 0
max_value: 10000000
Layer 4: GitOps Workflows for Data Pipelines
CI/CD Pipeline Architecture
Loading diagram...
GitHub Actions Workflow
# .github/workflows/data-platform-ci.yml
name: Data Platform CI
on:
pull_request:
paths:
- 'terraform/**'
- 'dbt_project/**'
- 'airflow/dags/**'
jobs:
terraform-plan:
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/environments/staging
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.x"
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_CICD_ROLE_ARN }}
aws-region: eu-west-1
- name: Terraform Init
run: terraform init -backend-config="key=staging/terraform.tfstate"
- name: Terraform Plan
run: terraform plan -out=tfplan.binary
- name: Convert Plan to JSON
run: terraform show -json tfplan.binary > tfplan.json
- name: Validate Plan (no deletes in prod tables)
run: |
python scripts/validate_plan.py tfplan.json --no-destroy-pattern "aws_glue_catalog_table" --no-destroy-pattern "aws_s3_bucket.lakehouse"
dbt-ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-spark dbt-expectations
- name: dbt deps
run: dbt deps
working-directory: dbt_project
- name: dbt compile (syntax check)
run: dbt compile --profiles-dir . --target ci
working-directory: dbt_project
- name: dbt test (dev schema)
run: dbt test --profiles-dir . --target ci --store-failures
working-directory: dbt_project
Environment Promotion Strategy
Loading diagram...
Variable Sets Per Environment
# environments/prod/terraform.tfvars
environment = "prod"
databricks_cluster_size = "Large"
min_workers = 2
max_workers = 20
enable_spot = true
spot_bid_price_pct = 80
backup_retention = 30
alert_email = "data-platform-oncall@company.com"
State Management and Secrets
Remote State with Locking
# shared/backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "${var.environment}/data-platform/terraform.tfstate"
region = "eu-west-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
kms_key_id = "arn:aws:kms:eu-west-1:123456789:key/mrk-xxx"
}
}
Secrets via Vault + Terraform
data "vault_generic_secret" "databricks_token" {
path = "secret/data-platform/${var.environment}/databricks"
}
resource "databricks_token" "pipeline_sa" {
comment = "CI/CD service account — managed by Terraform"
lifetime_seconds = 7776000 # 90 days, rotated by Vault
}
resource "vault_generic_secret" "databricks_token_output" {
path = "secret/data-platform/${var.environment}/databricks"
data_json = jsonencode({
token = databricks_token.pipeline_sa.token_value
})
}
Common Pitfalls and Mitigations
| Pitfall | Symptom | Mitigation |
|---|---|---|
| State drift | Terraform plan shows unexpected changes | Enforce terraform apply only via CI/CD; protect state bucket |
| Snowflake resources | Manual schema changes → drift | Require all changes via PR; drift detection in CI |
| Secret sprawl | Credentials in git, S3, Parameter Store, Vault | Single secrets backend; Terraform reads, never stores |
| Module versioning | Updates break all consumers | Pin module versions; use a private registry |
| Long plan times | 45-min plans → developer frustration | Break state into smaller files; use target for hotfixes |
Summary
IaC for data platforms isn't just about Terraform. It's a discipline that spans cloud resources (L1), data infrastructure (L2), schemas and pipelines (L3), and data products (L4). Each layer needs version control, CI/CD gates, and environment promotion.
The teams that do this well have a single pull request flow for everything: a developer changes a dbt model, Terraform module, and Airflow DAG in the same commit, CI validates all three layers, and promotion to production is a one-click release.
Try Harbinger Explorer free for 7 days — use it to validate your API contracts and data platform endpoints as part of your CI/CD pipeline. Catch schema drift and breaking changes before they reach production. harbingerexplorer.com
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial