Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Cloud Cost Allocation Strategies for Data Teams

11 min read·Tags: finops, cost-optimization, cloud-costs, data-platform, terraform, aws

Cloud Cost Allocation Strategies for Data Teams

Data teams are often the largest consumers of cloud resources in an organization—and frequently the least visible from a finance perspective. Spark clusters running unattended overnight, full-table scans on petabyte datasets, ML training jobs that no one cancelled—these add up fast.

FinOps for data platforms isn't about cutting spending; it's about making spending visible, attributable, and intentional. This guide covers how to do that practically.


The Cost Attribution Problem

Most data platform costs are invisible because they're commingled:

Loading diagram...

Without cost allocation, you can't answer "how much does the Marketing team's customer segmentation pipeline cost?" or "what's our cost per pipeline run for the fraud detection model?"


Foundation: Tagging Strategy

Tags are the foundation of cost allocation. Define a mandatory tagging schema and enforce it in CI/CD.

Core Tag Schema

Tag KeyExample ValuesPurpose
teamdata-platform, marketing-analytics, ml-platformTeam chargeback
productcustomer-360, fraud-detection, supply-chain-analyticsProduct P&L
environmentprod, staging, devEnvironment budgets
pipelinesilver-orders-transform, ml-feature-storePer-pipeline cost
cost-centerCC-1042, CC-3018Finance chargeback
data-classificationpublic, internal, confidentialSecurity + cost
managed-byterraform, cdk, manualGovernance

Terraform: Enforce Tags with AWS Config

# Define required tags for all resources
resource "aws_config_config_rule" "required_tags" {
  name = "required-tags-data-platform"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key   = "team"
    tag2Key   = "product"
    tag3Key   = "environment"
    tag4Key   = "cost-center"
    tag5Key   = "managed-by"
  })

  scope {
    compliance_resource_types = [
      "AWS::S3::Bucket",
      "AWS::RDS::DBInstance",
      "AWS::EMR::Cluster",
      "AWS::Glue::Job",
      "AWS::Redshift::Cluster"
    ]
  }
}

# Tag policy in AWS Organizations
resource "aws_organizations_policy" "mandatory_tags" {
  name = "mandatory-tags-data-platform"
  type = "TAG_POLICY"

  content = jsonencode({
    tags = {
      team = {
        tag_value = {
          "@@assign" = ["data-platform", "marketing-analytics", "ml-platform", "finance-analytics"]
        }
      }
      environment = {
        tag_value = {
          "@@assign" = ["prod", "staging", "dev"]
        }
      }
    }
  })
}

Default Tags in Terraform Provider

# Always apply base tags to every AWS resource
provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      managed-by    = "terraform"
      environment   = var.environment
      repository    = "github.com/myorg/data-platform"
      last-modified = timestamp()
    }
  }
}

Cost Allocation Models

Model 1: Showback (Visibility Only)

Show teams what they're spending without billing them. Best for building cost awareness culture.

# AWS Cost Explorer: cost by team tag
aws ce get-cost-and-usage   --time-period Start=$(date -d 'last month' +%Y-%m-01),End=$(date +%Y-%m-01)   --granularity MONTHLY   --metrics BlendedCost   --group-by Type=TAG,Key=team   --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE", "Values": ["Amazon S3", "Amazon EMR", "Amazon Redshift", "AWS Glue"]}},
      {"Tags": {"Key": "environment", "Values": ["prod"]}}
    ]
  }'   --output table

Model 2: Chargeback (Hard Billing)

Teams are billed for their actual cloud consumption. Drives accountability but requires mature tagging.

TeamS3ComputeDBTransferTotal
Data Platform (shared)$1,200$3,400$900$400$5,900
Marketing Analytics$2,800$12,300$0$1,100$16,200
ML Platform$4,100$9,800$1,200$800$15,900
Finance Analytics$800$2,400$6,800$200$10,200

Model 3: Unit Economics

Most actionable for data teams. Express costs in business terms:

MetricThis MonthTargetStatus
Cost per pipeline run (ETL)$0.43< $0.50
Cost per TB processed$18.20< $20.00
Cost per ML model training$124< $100
Cost per active dashboard$12/mo< $15/mo
Cost per data quality check$0.002< $0.005

Compute Cost Optimization

Spot Instances for ETL Workloads

90% of ETL jobs are Spot-tolerant. Use Spot for all non-latency-sensitive compute.

# EMR cluster with Spot instances
resource "aws_emr_cluster" "etl_cluster" {
  name          = "silver-transform-${var.environment}"
  release_label = "emr-6.13.0"
  applications  = ["Spark", "Hadoop"]

  ec2_attributes {
    subnet_id        = aws_subnet.private[0].id
    instance_profile = aws_iam_instance_profile.emr.arn
  }

  master_instance_group {
    instance_type = "m5.xlarge"  # On-demand for master (not Spot)
  }

  core_instance_group {
    instance_type  = "r5.2xlarge"
    instance_count = 2
    # Core nodes can be On-demand for stability
  }

  # Task nodes on Spot — 60-80% cheaper
  task_instance_group {
    instance_type  = "r5.2xlarge"
    instance_count = 4

    bid_price = "0.15"  # Max bid: $0.15/hr vs $0.504 On-demand

    ebs_config {
      size                 = 100
      type                 = "gp3"
      volumes_per_instance = 1
    }
  }

  auto_termination_policy {
    idle_timeout = 3600  # Kill cluster after 1h idle
  }

  tags = {
    team        = "data-platform"
    pipeline    = "silver-etl"
    environment = var.environment
  }
}

Spot Fleet for Multi-Instance Diversification

resource "aws_spot_fleet_request" "ml_training" {
  iam_fleet_role  = aws_iam_role.spot_fleet.arn
  target_capacity = 10
  allocation_strategy = "diversified"  # Don't put all eggs in one pool

  launch_specification {
    instance_type = "r5.4xlarge"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = aws_subnet.private[0].id
    spot_price    = "0.30"
  }

  launch_specification {
    instance_type = "r5a.4xlarge"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = aws_subnet.private[1].id
    spot_price    = "0.28"
  }

  launch_specification {
    instance_type = "m5.8xlarge"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = aws_subnet.private[2].id
    spot_price    = "0.35"
  }

  valid_until = timeadd(timestamp(), "720h")
}

Storage Cost Optimization

S3 Intelligent-Tiering

For data lake buckets where access patterns are unpredictable:

resource "aws_s3_bucket_intelligent_tiering_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake["silver"].id
  name   = "EntireBucket"

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
}

# Lifecycle rules: delete temp/staging data aggressively
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake["bronze"].id

  rule {
    id     = "expire-staging-data"
    status = "Enabled"

    filter {
      prefix = "staging/"
    }

    expiration {
      days = 7
    }

    noncurrent_version_expiration {
      noncurrent_days = 3
    }
  }

  rule {
    id     = "transition-bronze-to-ia"
    status = "Enabled"

    filter {
      prefix = "orders/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    expiration {
      days = 2555  # 7 years retention for compliance
    }
  }
}

Column Pruning and Compression

Parquet with Snappy is not always optimal. For cold analytics data:

# Convert to Parquet with ZSTD compression (30-40% smaller than Snappy)
spark-submit --class com.myorg.CompressJob my-job.jar   --input s3://my-platform-silver/orders/   --output s3://my-platform-silver/orders-compressed/   --format parquet   --compression zstd   --zstd-level 3  # Balance speed vs ratio

# Measure actual sizes
aws s3 ls --recursive s3://my-platform-silver/orders/ --summarize | tail -2
aws s3 ls --recursive s3://my-platform-silver/orders-compressed/ --summarize | tail -2

Query Cost Optimization

Athena: Cost by Query Tag

# Tag Athena queries for cost attribution
aws athena start-query-execution   --query-string "SELECT * FROM gold.orders WHERE dt = '2024-01-15'"   --work-group marketing-analytics   --query-execution-context Database=gold   --result-configuration OutputLocation=s3://my-platform-athena-results/marketing/

# Athena workgroup with per-query data scanned limit
resource "aws_athena_workgroup" "marketing" {
  name = "marketing-analytics"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/marketing/"

      encryption_configuration {
        encryption_option = "SSE_KMS"
        kms_key            = aws_kms_key.data_platform.arn
      }
    }

    bytes_scanned_cutoff_per_query = 10737418240  # 10 GB max per query
  }

  tags = {
    team = "marketing-analytics"
  }
}

Partition Pruning: The Single Biggest Athena Optimization

-- BAD: Full table scan ($$$)
SELECT COUNT(*) FROM gold.events
WHERE event_timestamp >= '2024-01-01';

-- GOOD: Partition pruned (scans only Jan 2024 partitions)
SELECT COUNT(*) FROM gold.events
WHERE year = '2024' AND month = '01';

-- Check if partition pruning is working
EXPLAIN SELECT COUNT(*) FROM gold.events
WHERE year = '2024' AND month = '01';
-- Look for "partition count" in output vs total partitions

Budget Alerts and Anomaly Detection

resource "aws_budgets_budget" "data_platform" {
  name         = "data-platform-monthly-${var.environment}"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name = "TagKeyValue"
    values = ["user:environment$${var.environment}"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["data-platform-lead@myorg.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["data-platform-lead@myorg.com", "cto@myorg.com"]
  }
}

# AWS Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "data_platform" {
  name              = "data-platform-anomaly-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "data_platform" {
  name      = "data-platform-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [aws_ce_anomaly_monitor.data_platform.arn]

  subscriber {
    address = "data-platform-lead@myorg.com"
    type    = "EMAIL"
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
        match_options = ["GREATER_THAN_OR_EQUAL"]
        values        = ["20"]
      }
    }
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        match_options = ["GREATER_THAN_OR_EQUAL"]
        values        = ["500"]
      }
    }
  }
}

FinOps Maturity Model for Data Teams

LevelCharacteristicsActions
1 - CrawlNo tags, no budgets, monthly surprise billsImplement tag schema, set budgets
2 - WalkTags on new resources, showback reportsEnforce tags in CI/CD, unit economics
3 - RunFull chargeback, Spot usage >50%, query optimizationAnomaly detection, per-pipeline cost SLOs
4 - OptimizeUnit economics, automated rightsizing, waste eliminationML-based forecasting, commitment planning

Visibility with Harbinger Explorer

Cost allocation only works when you can correlate cloud spend with data platform activity. Harbinger Explorer links pipeline runs, query counts, and data volumes to your cost data—so you can see exactly which pipelines are driving cost and where optimization will have the most impact.

# Quick cost audit: top 10 Glue jobs by DPU hours this month
aws glue get-job-runs --job-name silver-orders-transform   --query 'JobRuns[?CompletedOn>=`2024-01-01`].[JobName,ExecutionTime,MaxCapacity]'   --output table | sort -k3 -rn | head -10

Summary

Cloud cost allocation for data teams is a discipline, not a one-time project:

  1. Tag everything — enforce in CI/CD with AWS Config rules
  2. Show costs in business units — cost per pipeline run, per TB processed
  3. Use Spot for ETL — 60-80% savings with proper retry logic
  4. Set per-team budgets with anomaly detection
  5. Optimize queries at the source — partition pruning > infrastructure optimization
  6. Build FinOps culture — weekly cost reviews, engineer-owned cost metrics

Try Harbinger Explorer free for 7 days and get instant cost attribution visibility across your cloud data platform—correlate spending with pipeline activity and identify your biggest optimization opportunities.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...