Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Cloud Data Platform Cost Management Guide

13 min read·Tags: FinOps, cost-management, cloud-costs, Databricks, BigQuery, Snowflake, Redshift

Cloud Data Platform Cost Management Guide

Cloud data platform bills are notoriously surprising. A team that spins up a new Databricks workspace for "quick exploration" can generate $40k in monthly spend within two weeks. A poorly written BigQuery query scanning full tables costs more per run than the analyst's hourly rate. This guide gives you the frameworks and specific techniques to bring those costs under control.


The Cost Anatomy of a Data Platform

Understanding where money goes is the prerequisite for managing it:

Loading diagram...

Most teams are surprised to find compute dominates even when they feel "mostly idle." The culprit is almost always auto-scaling that scales up but doesn't scale down, and long-running clusters that outlive their workloads.


Compute Optimisation

1. Aggressive Auto-termination

The single highest-ROI change for most platforms. Default auto-termination settings are often set to "never" or "4 hours" by developers who don't want to wait for cold starts.

Databricks:

# Set via API or Terraform — enforce via cluster policy
cluster_policy = {
    "autotermination_minutes": {
        "type": "range",
        "minValue": 10,
        "maxValue": 60,
        "defaultValue": 20,
        "hidden": False
    }
}
# Audit running clusters with no active workload
databricks clusters list --output json |   jq '.[] | select(.state == "RUNNING") | 
      {id: .cluster_id, name: .cluster_name, 
       last_activity: .last_activity_time}'

BigQuery: Slot reservations vs on-demand is the key decision:

ModeBest forMonthly cost (100TB scanned)
On-demandUnpredictable, bursty workloads~$500
Slot reservation (100 slots)Predictable high-volume~$2,000 flat
Flex slots (hourly)Batch jobs, scheduled~$400 if <200h/month

The break-even for on-demand vs reservation is roughly 400 slot-hours/month. Profile your workload before committing.

2. Spot / Preemptible Instances for Batch

Most data engineering workloads (ELT, training data prep, model batch scoring) tolerate interruptions with proper checkpointing. Spot instances reduce compute costs by 60-80%.

# Databricks — spot with fallback
resource "databricks_job_cluster" "etl_spot" {
  new_cluster {
    num_workers = 4
    node_type_id = "m5d.2xlarge"
    
    aws_attributes {
      availability           = "SPOT_WITH_FALLBACK"
      spot_bid_price_percent = 80
      first_on_demand        = 1  # driver always on-demand
    }
    
    spark_conf = {
      "spark.sql.adaptive.enabled"                = "true"
      "spark.sql.adaptive.coalescePartitions.enabled" = "true"
    }
  }
}

3. Compute Rightsizing

The most common waste pattern: teams request large clusters "just in case" and never downsize.

# Find Databricks clusters consistently using <20% CPU
databricks clusters list --output json |   jq -r '.[].cluster_id' | while read id; do
    # Pull metrics from Ganglia or custom logging
    echo "Cluster: $id"
    databricks clusters get --cluster-id "$id" |       jq '.default_tags, .node_type_id, .num_workers'
done

Establish a quarterly rightsizing review — compare actual CPU/memory utilisation against provisioned capacity for every persistent cluster.


Storage Optimisation

1. Lifecycle Policies

Loading diagram...

For a 100 TB dataset ingesting 1 TB/day, the difference between "everything in Standard" and a proper lifecycle policy is ~$1,500/month.

2. Table Format Compaction

Delta Lake and Apache Iceberg accumulate small files aggressively. 10,000 files of 1 MB each costs more to read (and more in S3 LIST operations) than 100 files of 100 MB each.

# Delta Lake — run weekly via scheduled job
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Optimise and Z-order by most common filter columns
dt = DeltaTable.forPath(spark, "s3://lakehouse/silver/events/")
dt.optimize().executeZOrderBy("event_date", "user_id")

# Vacuum old versions (default retention = 7 days)
dt.vacuum(retentionHours=168)

3. Column Pruning and Compression

-- BigQuery: avoid SELECT * in production queries
-- This scans 500 GB:
SELECT * FROM events WHERE date = '2024-01-01';

-- This scans 8 GB (same result):
SELECT user_id, event_type, properties
FROM events
WHERE date = '2024-01-01';

At BigQuery's on-demand pricing ($6.25/TB), that single query change saves $3.07 per execution. At 50 executions/day, that's $56k/year from one query.


Query Cost Management

BigQuery Query Cost Estimation

Always estimate before running:

# Dry run — shows bytes processed, no charge
bq query   --use_legacy_sql=false   --dry_run   'SELECT user_id, COUNT(*) as sessions
   FROM `project.dataset.sessions`
   WHERE date BETWEEN "2024-01-01" AND "2024-03-31"
   GROUP BY 1'
# Output: Query successfully validated. Assuming the tables are not modified,
# running this query will process 47,832,918,016 bytes.
# Cost: ~$0.30

Snowflake — Warehouse Sizing and Suspension

-- Create a warehouse that auto-suspends after 1 minute of inactivity
CREATE WAREHOUSE analytics_wh
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE
  COMMENT = 'Analyst queries — suspend after 1 min idle';

-- Monitor warehouse credit consumption
SELECT 
    warehouse_name,
    SUM(credits_used) as total_credits,
    SUM(credits_used) * 3.00 as estimated_cost_usd,  -- adjust for your rate
    COUNT(DISTINCT query_id) as query_count
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 2 DESC;

Query Cost Attribution

Tag every query with team and use-case metadata:

-- Snowflake: use session-level tags
ALTER SESSION SET QUERY_TAG = '{"team":"analytics","project":"q1-reporting","owner":"alice"}';

SELECT ... ;  -- query is tagged

-- BigQuery: use job labels
-- In Python client:
job_config = bigquery.QueryJobConfig(
    labels={
        "team": "analytics",
        "project": "q1-reporting",
        "owner": "alice"
    }
)

This feeds into your FinOps cost allocation reports.


FinOps Framework for Data Teams

The Three-Phase Maturity Model

Loading diagram...

Phase 1 — Visibility (week 1-4):

  • Enable detailed billing exports to your data warehouse
  • Build a cost dashboard (by team, environment, service)
  • Set up budget alerts at 80% and 100% of monthly budget

Phase 2 — Optimisation (month 2-3):

  • Identify top 10 cost drivers
  • Apply quick wins: auto-termination, lifecycle policies, query optimisation
  • Establish unit economics (cost per pipeline run, cost per TB processed)

Phase 3 — Operations (ongoing):

  • Weekly anomaly detection alerts
  • Monthly rightsizing review
  • Quarterly commitment (reserved instances / committed use) renewal

Budget Alerting with Terraform

# AWS Budget alert
resource "aws_budgets_budget" "data_platform" {
  name         = "data-platform-monthly"
  budget_type  = "COST"
  limit_amount = "15000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["team$data-platform"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["data-platform-lead@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["data-platform-lead@company.com", "cto@company.com"]
  }
}

Cost Benchmarks by Workload Type

WorkloadTypical Cost (1 TB/day)Key Optimisation
Batch ETL (Spark)$200-800/monthSpot instances, optimised shuffles
Streaming (Kafka + Flink)$400-1200/monthPartition tuning, backpressure
Ad-hoc analytics (BigQuery)$100-2000/monthPartition pruning, slot reservations
ML training (GPU)$500-5000/monthSpot, preemption-aware training
Data serving (APIs)$50-300/monthCaching, read replicas

Practical Quick Wins Checklist

Run through this list for an immediate 20-40% cost reduction in most environments:

  • Set auto-termination ≤ 60 min on all development clusters
  • Enable S3 Intelligent-Tiering on all buckets > 30 days old
  • Schedule off-hours shutdown of non-production warehouses
  • Replace SELECT * with explicit column lists in top 20 queries by frequency
  • Enable Delta/Iceberg compaction weekly jobs
  • Tag all resources with team, env, project
  • Set up Savings Plans / Committed Use for baseline compute
  • Move development to smaller instance types
  • Enable query result caching in your warehouse
  • Audit cross-AZ and cross-region data transfer

Tooling Overview

ToolBest forOpen source?
AWS Cost ExplorerAWS cost analysisNo (free tier)
InfracostTerraform cost preview in CIYes
CloudHealthMulti-cloud FinOpsNo
VantageData platform-focused FinOpsNo
OpenCostKubernetes cost allocationYes
Databricks Cost DashboardDatabricks-specificBuilt-in

Summary

Cost management for data platforms is an ongoing operational discipline, not a one-time project. The biggest wins come from compute auto-termination (usually immediate), storage lifecycle policies (next-highest ROI), and query optimisation (highest engineering effort, highest long-term impact).

Build visibility first, then optimise, then operationalise. The teams that do this well treat unit economics — cost per pipeline run, cost per query, cost per TB — as first-class engineering metrics alongside latency and reliability.


Try Harbinger Explorer free for 7 days — understand your data platform API usage and identify expensive, redundant endpoint calls before they inflate your bill. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...