cloud-architecture

Published: Apr 3, 2026

Observability for Cloud Data Platforms: The Complete Guide

13 min read·Tags: observability, monitoring, data-quality, opentelemetry, sre, data-platform

Observability for Cloud Data Platforms: The Complete Guide

You can't trust data you can't observe. Yet most data teams invest heavily in pipeline building and almost nothing in pipeline monitoring—until something breaks in production and a VP asks why yesterday's revenue report is wrong.

Observability for data platforms goes beyond "is the job running?" It asks: Is the data fresh? Is the schema intact? Are row counts within expected bounds? Is the join quality degrading silently?

This guide covers how to instrument, monitor, and alert across the full data stack.

The Four Pillars of Data Platform Observability

Loading diagram...

Pillar	What It Tells You	Tooling
Metrics	System behavior over time	Prometheus, CloudWatch, Datadog
Logs	Why something happened	ELK, CloudWatch Logs, Loki
Traces	Where time is spent	Jaeger, Tempo, X-Ray
Data Quality	Is the data correct?	Great Expectations, dbt tests, custom

Pillar 1: Metrics

Key Metrics for Data Pipelines

Every pipeline should expose these signals:

# Prometheus metrics to instrument in your pipeline code
metrics:
  ingestion:
    - records_ingested_total{source, table, environment}
    - ingestion_latency_seconds{source, table}
    - ingestion_errors_total{source, table, error_type}
    - last_successful_run_timestamp{pipeline_id}

  transformation:
    - records_transformed_total{stage, table}
    - records_dropped_total{stage, table, reason}
    - transformation_duration_seconds{pipeline_id, stage}
    - schema_drift_events_total{table, field}

  serving:
    - query_latency_p99_seconds{dataset, query_type}
    - stale_data_seconds{table}  # now - max(updated_at)
    - query_error_rate{dataset}

Prometheus + Grafana: Pipeline Dashboard

# docker-compose.yml for local observability stack
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:v2.47.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - --storage.tsdb.retention.time=90d
      - --web.enable-lifecycle

  grafana:
    image: grafana/grafana:10.1.0
    ports: ["3000:3000"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

volumes:
  prometheus_data:
  grafana_data:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: spark_pipelines
    static_configs:
      - targets: ['spark-driver:4040']
    metrics_path: /metrics/prometheus

  - job_name: airflow
    static_configs:
      - targets: ['airflow-webserver:8080']
    metrics_path: /metrics

  - job_name: kafka
    static_configs:
      - targets: ['kafka-broker-1:9308', 'kafka-broker-2:9308']

SLOs for Data Pipelines

Define Service Level Objectives around data freshness and availability:

Pipeline	SLO	Measurement
Orders Bronze Landing	< 5 min latency from source	`now() - max(event_time)`
Silver Transform	Completes within 30 min of Bronze	Task duration
Gold Aggregates	Available by 06:00 UTC daily	Scheduled completion time
ML Feature Store	< 1h staleness	`now() - max(feature_timestamp)`
Revenue Dashboard	99.5% daily availability	Query success rate

# Prometheus alert for data freshness SLO breach
groups:
  - name: data_freshness
    rules:
      - alert: OrdersTableStale
        expr: |
          (time() - data_platform_last_successful_ingestion_timestamp{table="orders"}) > 300
        for: 2m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Orders table is stale ({{ $value | humanizeDuration }})"
          runbook: "https://wiki.internal/runbooks/stale-orders"

      - alert: SilverTransformSLOBreach
        expr: |
          data_platform_transformation_duration_seconds{pipeline="bronze_to_silver_orders"} > 1800
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Silver transform exceeding 30-min SLO"

Pillar 2: Structured Logging

Log Schema for Pipeline Events

Unstructured logs are nearly useless at scale. Enforce a schema:

# Structured log schema (JSON)
log_event:
  timestamp: "2024-01-15T08:32:11.421Z"
  level: INFO | WARN | ERROR
  pipeline_id: "bronze_to_silver_orders"
  run_id: "run_20240115_083200"
  stage: "read | transform | write | validate"
  table: "silver.orders"
  records_in: 142847
  records_out: 142831
  records_dropped: 16
  drop_reason: "schema_mismatch"
  duration_ms: 28431
  environment: "prod"
  spark_app_id: "app-20240115083200-0001"
  correlation_id: "req-abc123"

Terraform: CloudWatch Log Group with Retention

resource "aws_cloudwatch_log_group" "pipeline_logs" {
  for_each          = toset(["bronze-ingestion", "silver-transform", "gold-aggregate", "data-quality"])
  name              = "/data-platform/${var.environment}/${each.key}"
  retention_in_days = 90
  kms_key_id        = aws_kms_key.data_platform.arn

  tags = {
    Pipeline    = each.key
    Environment = var.environment
  }
}

# Log metric filter: extract error count from structured logs
resource "aws_cloudwatch_log_metric_filter" "pipeline_errors" {
  for_each       = aws_cloudwatch_log_group.pipeline_logs
  name           = "${each.key}-errors"
  pattern        = "{ $.level = "ERROR" }"
  log_group_name = each.value.name

  metric_transformation {
    name          = "PipelineErrors"
    namespace     = "DataPlatform/${var.environment}"
    value         = "1"
    default_value = "0"
    dimensions = {
      Pipeline = each.key
    }
  }
}

Pillar 3: Distributed Tracing

OpenTelemetry for Data Pipelines

Distributed tracing lets you follow a record's journey from source to Gold layer:

# OpenTelemetry Collector config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 512
  resource:
    attributes:
      - key: environment
        value: prod
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: false
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumenting a Spark Pipeline with OTel

# Add OTel Java agent to Spark submit
spark-submit   --conf "spark.driver.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.executor.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.driver.extraJavaOptions=-Dotel.service.name=etl-silver-transform"   --conf "spark.driver.extraJavaOptions=-Dotel.exporter.otlp.endpoint=http://otel-collector:4317"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler=parentbased_traceidratio"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler.arg=0.1"   my-etl-job.jar

Pillar 4: Data Quality Monitoring

This is the pillar most teams skip—and the one that matters most to data consumers.

The Data Quality Dimensions

Loading diagram...

dbt Data Tests

# schema.yml
version: 2

models:
  - name: silver_orders
    description: "Cleaned and deduplicated orders"
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
      - name: created_at
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: hour
              field: created_at
              interval: 6

Volume Anomaly Detection

# SQL: Detect volume anomalies using 7-day rolling stats
WITH daily_volumes AS (
  SELECT
    DATE(created_at) AS dt,
    COUNT(*) AS row_count
  FROM silver.orders
  WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY 1
),
stats AS (
  SELECT
    AVG(row_count) AS mean,
    STDDEV(row_count) AS stddev
  FROM daily_volumes
  WHERE dt < CURRENT_DATE  -- exclude today
)
SELECT
  dv.dt,
  dv.row_count,
  s.mean,
  s.stddev,
  ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) AS z_score,
  CASE WHEN ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) > 3
    THEN 'ANOMALY' ELSE 'OK' END AS status
FROM daily_volumes dv
CROSS JOIN stats s
WHERE dv.dt = CURRENT_DATE;

Alerting Strategy

Alert Routing by Severity

Severity	Condition	Channel	SLA
P1 - Critical	Data > 4h stale, pipeline down	PagerDuty + Slack	15 min
P2 - High	Schema drift, volume anomaly >3σ	Slack #data-alerts	1 hour
P3 - Medium	Quality test failures, slow queries	Jira ticket	4 hours
P4 - Low	Cost anomaly, performance regression	Email digest	Next business day

Terraform: CloudWatch Alarm + SNS

resource "aws_cloudwatch_metric_alarm" "pipeline_freshness" {
  alarm_name          = "orders-table-freshness-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DataStalenessSeconds"
  namespace           = "DataPlatform/prod"
  period              = 300
  statistic           = "Maximum"
  threshold           = 14400  # 4 hours
  alarm_description   = "Orders table data is older than 4 hours"
  treat_missing_data  = "breaching"

  alarm_actions = [aws_sns_topic.data_platform_alerts.arn]
  ok_actions    = [aws_sns_topic.data_platform_alerts.arn]

  dimensions = {
    Table = "orders"
  }
}

Harbinger Explorer: Observability Purpose-Built for Data

While Prometheus/Grafana covers infrastructure metrics and dbt covers test-level quality, platforms like Harbinger Explorer bridge the gap—providing real-time data lineage, automated anomaly detection, and cross-pipeline SLO tracking without requiring custom instrumentation per pipeline.

Key capabilities for data platform teams:

Automated lineage: Understand downstream impact before making changes
Statistical quality baselines: Auto-learns normal volume/distribution ranges
Schema change notifications: Instant alerts on upstream schema drift
Pipeline health scorecards: SLO compliance at the dataset level

Quick-Start Observability Checklist

# Verify pipeline has exposed Prometheus metrics
curl -s http://spark-driver:4040/metrics/prometheus | grep records_processed

# Check freshness of a critical table
psql -c "SELECT MAX(updated_at), NOW() - MAX(updated_at) AS age FROM silver.orders"

# Run dbt tests
dbt test --select silver_orders --store-failures

# Check Kafka consumer lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092   --group etl-silver-transform --describe | awk '{print $5, $6}'

Summary

Production data platform observability requires all four pillars:

Metrics — SLOs around freshness, throughput, error rate
Logs — Structured JSON with correlation IDs and record counts
Traces — End-to-end lineage with OpenTelemetry
Data Quality — Freshness, volume, schema, and distribution checks at every layer

Start with freshness metrics and dbt tests—they give the highest signal-to-noise ratio. Add distributed tracing once you have stable pipelines you need to optimize.

Try Harbinger Explorer free for 7 days and bring production-grade observability to your cloud data platform—automated anomaly detection, lineage tracking, and SLO dashboards out of the box.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Observability for Cloud Data Platforms: The Complete Guide

Observability for Cloud Data Platforms: The Complete Guide

The Four Pillars of Data Platform Observability

Pillar 1: Metrics

Key Metrics for Data Pipelines

Prometheus + Grafana: Pipeline Dashboard

SLOs for Data Pipelines

Pillar 2: Structured Logging

Log Schema for Pipeline Events

Terraform: CloudWatch Log Group with Retention

Pillar 3: Distributed Tracing

OpenTelemetry for Data Pipelines

Instrumenting a Spark Pipeline with OTel

Pillar 4: Data Quality Monitoring

The Data Quality Dimensions

dbt Data Tests

Volume Anomaly Detection

Alerting Strategy

Alert Routing by Severity

Terraform: CloudWatch Alarm + SNS

Harbinger Explorer: Observability Purpose-Built for Data

Quick-Start Observability Checklist

Summary

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Observability for Cloud Data Platforms: The Complete Guide

The Four Pillars of Data Platform Observability

Pillar 1: Metrics

Key Metrics for Data Pipelines

Prometheus + Grafana: Pipeline Dashboard

SLOs for Data Pipelines

Pillar 2: Structured Logging

Log Schema for Pipeline Events

Terraform: CloudWatch Log Group with Retention

Pillar 3: Distributed Tracing

OpenTelemetry for Data Pipelines

Instrumenting a Spark Pipeline with OTel

Pillar 4: Data Quality Monitoring

The Data Quality Dimensions

dbt Data Tests

Volume Anomaly Detection

Alerting Strategy

Alert Routing by Severity

Terraform: CloudWatch Alarm + SNS

Harbinger Explorer: Observability Purpose-Built for Data

Quick-Start Observability Checklist

Summary

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette