Cloud Data Platform Cost Management Guide
Cloud Data Platform Cost Management Guide
Cloud data platform bills are notoriously surprising. A team that spins up a new Databricks workspace for "quick exploration" can generate $40k in monthly spend within two weeks. A poorly written BigQuery query scanning full tables costs more per run than the analyst's hourly rate. This guide gives you the frameworks and specific techniques to bring those costs under control.
The Cost Anatomy of a Data Platform
Understanding where money goes is the prerequisite for managing it:
Loading diagram...
Most teams are surprised to find compute dominates even when they feel "mostly idle." The culprit is almost always auto-scaling that scales up but doesn't scale down, and long-running clusters that outlive their workloads.
Compute Optimisation
1. Aggressive Auto-termination
The single highest-ROI change for most platforms. Default auto-termination settings are often set to "never" or "4 hours" by developers who don't want to wait for cold starts.
Databricks:
# Set via API or Terraform — enforce via cluster policy
cluster_policy = {
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 60,
"defaultValue": 20,
"hidden": False
}
}
# Audit running clusters with no active workload
databricks clusters list --output json | jq '.[] | select(.state == "RUNNING") |
{id: .cluster_id, name: .cluster_name,
last_activity: .last_activity_time}'
BigQuery: Slot reservations vs on-demand is the key decision:
| Mode | Best for | Monthly cost (100TB scanned) |
|---|---|---|
| On-demand | Unpredictable, bursty workloads | ~$500 |
| Slot reservation (100 slots) | Predictable high-volume | ~$2,000 flat |
| Flex slots (hourly) | Batch jobs, scheduled | ~$400 if <200h/month |
The break-even for on-demand vs reservation is roughly 400 slot-hours/month. Profile your workload before committing.
2. Spot / Preemptible Instances for Batch
Most data engineering workloads (ELT, training data prep, model batch scoring) tolerate interruptions with proper checkpointing. Spot instances reduce compute costs by 60-80%.
# Databricks — spot with fallback
resource "databricks_job_cluster" "etl_spot" {
new_cluster {
num_workers = 4
node_type_id = "m5d.2xlarge"
aws_attributes {
availability = "SPOT_WITH_FALLBACK"
spot_bid_price_percent = 80
first_on_demand = 1 # driver always on-demand
}
spark_conf = {
"spark.sql.adaptive.enabled" = "true"
"spark.sql.adaptive.coalescePartitions.enabled" = "true"
}
}
}
3. Compute Rightsizing
The most common waste pattern: teams request large clusters "just in case" and never downsize.
# Find Databricks clusters consistently using <20% CPU
databricks clusters list --output json | jq -r '.[].cluster_id' | while read id; do
# Pull metrics from Ganglia or custom logging
echo "Cluster: $id"
databricks clusters get --cluster-id "$id" | jq '.default_tags, .node_type_id, .num_workers'
done
Establish a quarterly rightsizing review — compare actual CPU/memory utilisation against provisioned capacity for every persistent cluster.
Storage Optimisation
1. Lifecycle Policies
Loading diagram...
For a 100 TB dataset ingesting 1 TB/day, the difference between "everything in Standard" and a proper lifecycle policy is ~$1,500/month.
2. Table Format Compaction
Delta Lake and Apache Iceberg accumulate small files aggressively. 10,000 files of 1 MB each costs more to read (and more in S3 LIST operations) than 100 files of 100 MB each.
# Delta Lake — run weekly via scheduled job
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Optimise and Z-order by most common filter columns
dt = DeltaTable.forPath(spark, "s3://lakehouse/silver/events/")
dt.optimize().executeZOrderBy("event_date", "user_id")
# Vacuum old versions (default retention = 7 days)
dt.vacuum(retentionHours=168)
3. Column Pruning and Compression
-- BigQuery: avoid SELECT * in production queries
-- This scans 500 GB:
SELECT * FROM events WHERE date = '2024-01-01';
-- This scans 8 GB (same result):
SELECT user_id, event_type, properties
FROM events
WHERE date = '2024-01-01';
At BigQuery's on-demand pricing ($6.25/TB), that single query change saves $3.07 per execution. At 50 executions/day, that's $56k/year from one query.
Query Cost Management
BigQuery Query Cost Estimation
Always estimate before running:
# Dry run — shows bytes processed, no charge
bq query --use_legacy_sql=false --dry_run 'SELECT user_id, COUNT(*) as sessions
FROM `project.dataset.sessions`
WHERE date BETWEEN "2024-01-01" AND "2024-03-31"
GROUP BY 1'
# Output: Query successfully validated. Assuming the tables are not modified,
# running this query will process 47,832,918,016 bytes.
# Cost: ~$0.30
Snowflake — Warehouse Sizing and Suspension
-- Create a warehouse that auto-suspends after 1 minute of inactivity
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE
COMMENT = 'Analyst queries — suspend after 1 min idle';
-- Monitor warehouse credit consumption
SELECT
warehouse_name,
SUM(credits_used) as total_credits,
SUM(credits_used) * 3.00 as estimated_cost_usd, -- adjust for your rate
COUNT(DISTINCT query_id) as query_count
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 2 DESC;
Query Cost Attribution
Tag every query with team and use-case metadata:
-- Snowflake: use session-level tags
ALTER SESSION SET QUERY_TAG = '{"team":"analytics","project":"q1-reporting","owner":"alice"}';
SELECT ... ; -- query is tagged
-- BigQuery: use job labels
-- In Python client:
job_config = bigquery.QueryJobConfig(
labels={
"team": "analytics",
"project": "q1-reporting",
"owner": "alice"
}
)
This feeds into your FinOps cost allocation reports.
FinOps Framework for Data Teams
The Three-Phase Maturity Model
Loading diagram...
Phase 1 — Visibility (week 1-4):
- Enable detailed billing exports to your data warehouse
- Build a cost dashboard (by team, environment, service)
- Set up budget alerts at 80% and 100% of monthly budget
Phase 2 — Optimisation (month 2-3):
- Identify top 10 cost drivers
- Apply quick wins: auto-termination, lifecycle policies, query optimisation
- Establish unit economics (cost per pipeline run, cost per TB processed)
Phase 3 — Operations (ongoing):
- Weekly anomaly detection alerts
- Monthly rightsizing review
- Quarterly commitment (reserved instances / committed use) renewal
Budget Alerting with Terraform
# AWS Budget alert
resource "aws_budgets_budget" "data_platform" {
name = "data-platform-monthly"
budget_type = "COST"
limit_amount = "15000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["team$data-platform"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["data-platform-lead@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["data-platform-lead@company.com", "cto@company.com"]
}
}
Cost Benchmarks by Workload Type
| Workload | Typical Cost (1 TB/day) | Key Optimisation |
|---|---|---|
| Batch ETL (Spark) | $200-800/month | Spot instances, optimised shuffles |
| Streaming (Kafka + Flink) | $400-1200/month | Partition tuning, backpressure |
| Ad-hoc analytics (BigQuery) | $100-2000/month | Partition pruning, slot reservations |
| ML training (GPU) | $500-5000/month | Spot, preemption-aware training |
| Data serving (APIs) | $50-300/month | Caching, read replicas |
Practical Quick Wins Checklist
Run through this list for an immediate 20-40% cost reduction in most environments:
- Set auto-termination ≤ 60 min on all development clusters
- Enable S3 Intelligent-Tiering on all buckets > 30 days old
- Schedule off-hours shutdown of non-production warehouses
- Replace
SELECT *with explicit column lists in top 20 queries by frequency - Enable Delta/Iceberg compaction weekly jobs
- Tag all resources with
team,env,project - Set up Savings Plans / Committed Use for baseline compute
- Move development to smaller instance types
- Enable query result caching in your warehouse
- Audit cross-AZ and cross-region data transfer
Tooling Overview
| Tool | Best for | Open source? |
|---|---|---|
| AWS Cost Explorer | AWS cost analysis | No (free tier) |
| Infracost | Terraform cost preview in CI | Yes |
| CloudHealth | Multi-cloud FinOps | No |
| Vantage | Data platform-focused FinOps | No |
| OpenCost | Kubernetes cost allocation | Yes |
| Databricks Cost Dashboard | Databricks-specific | Built-in |
Summary
Cost management for data platforms is an ongoing operational discipline, not a one-time project. The biggest wins come from compute auto-termination (usually immediate), storage lifecycle policies (next-highest ROI), and query optimisation (highest engineering effort, highest long-term impact).
Build visibility first, then optimise, then operationalise. The teams that do this well treat unit economics — cost per pipeline run, cost per query, cost per TB — as first-class engineering metrics alongside latency and reliability.
Try Harbinger Explorer free for 7 days — understand your data platform API usage and identify expensive, redundant endpoint calls before they inflate your bill. harbingerexplorer.com
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial