Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Observability for Cloud Data Platforms: The Complete Guide

13 min read·Tags: observability, monitoring, data-quality, opentelemetry, sre, data-platform

Observability for Cloud Data Platforms: The Complete Guide

You can't trust data you can't observe. Yet most data teams invest heavily in pipeline building and almost nothing in pipeline monitoring—until something breaks in production and a VP asks why yesterday's revenue report is wrong.

Observability for data platforms goes beyond "is the job running?" It asks: Is the data fresh? Is the schema intact? Are row counts within expected bounds? Is the join quality degrading silently?

This guide covers how to instrument, monitor, and alert across the full data stack.


The Four Pillars of Data Platform Observability

Loading diagram...
PillarWhat It Tells YouTooling
MetricsSystem behavior over timePrometheus, CloudWatch, Datadog
LogsWhy something happenedELK, CloudWatch Logs, Loki
TracesWhere time is spentJaeger, Tempo, X-Ray
Data QualityIs the data correct?Great Expectations, dbt tests, custom

Pillar 1: Metrics

Key Metrics for Data Pipelines

Every pipeline should expose these signals:

# Prometheus metrics to instrument in your pipeline code
metrics:
  ingestion:
    - records_ingested_total{source, table, environment}
    - ingestion_latency_seconds{source, table}
    - ingestion_errors_total{source, table, error_type}
    - last_successful_run_timestamp{pipeline_id}

  transformation:
    - records_transformed_total{stage, table}
    - records_dropped_total{stage, table, reason}
    - transformation_duration_seconds{pipeline_id, stage}
    - schema_drift_events_total{table, field}

  serving:
    - query_latency_p99_seconds{dataset, query_type}
    - stale_data_seconds{table}  # now - max(updated_at)
    - query_error_rate{dataset}

Prometheus + Grafana: Pipeline Dashboard

# docker-compose.yml for local observability stack
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:v2.47.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - --storage.tsdb.retention.time=90d
      - --web.enable-lifecycle

  grafana:
    image: grafana/grafana:10.1.0
    ports: ["3000:3000"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

volumes:
  prometheus_data:
  grafana_data:
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: spark_pipelines
    static_configs:
      - targets: ['spark-driver:4040']
    metrics_path: /metrics/prometheus

  - job_name: airflow
    static_configs:
      - targets: ['airflow-webserver:8080']
    metrics_path: /metrics

  - job_name: kafka
    static_configs:
      - targets: ['kafka-broker-1:9308', 'kafka-broker-2:9308']

SLOs for Data Pipelines

Define Service Level Objectives around data freshness and availability:

PipelineSLOMeasurement
Orders Bronze Landing< 5 min latency from sourcenow() - max(event_time)
Silver TransformCompletes within 30 min of BronzeTask duration
Gold AggregatesAvailable by 06:00 UTC dailyScheduled completion time
ML Feature Store< 1h stalenessnow() - max(feature_timestamp)
Revenue Dashboard99.5% daily availabilityQuery success rate
# Prometheus alert for data freshness SLO breach
groups:
  - name: data_freshness
    rules:
      - alert: OrdersTableStale
        expr: |
          (time() - data_platform_last_successful_ingestion_timestamp{table="orders"}) > 300
        for: 2m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Orders table is stale ({{ $value | humanizeDuration }})"
          runbook: "https://wiki.internal/runbooks/stale-orders"

      - alert: SilverTransformSLOBreach
        expr: |
          data_platform_transformation_duration_seconds{pipeline="bronze_to_silver_orders"} > 1800
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Silver transform exceeding 30-min SLO"

Pillar 2: Structured Logging

Log Schema for Pipeline Events

Unstructured logs are nearly useless at scale. Enforce a schema:

# Structured log schema (JSON)
log_event:
  timestamp: "2024-01-15T08:32:11.421Z"
  level: INFO | WARN | ERROR
  pipeline_id: "bronze_to_silver_orders"
  run_id: "run_20240115_083200"
  stage: "read | transform | write | validate"
  table: "silver.orders"
  records_in: 142847
  records_out: 142831
  records_dropped: 16
  drop_reason: "schema_mismatch"
  duration_ms: 28431
  environment: "prod"
  spark_app_id: "app-20240115083200-0001"
  correlation_id: "req-abc123"

Terraform: CloudWatch Log Group with Retention

resource "aws_cloudwatch_log_group" "pipeline_logs" {
  for_each          = toset(["bronze-ingestion", "silver-transform", "gold-aggregate", "data-quality"])
  name              = "/data-platform/${var.environment}/${each.key}"
  retention_in_days = 90
  kms_key_id        = aws_kms_key.data_platform.arn

  tags = {
    Pipeline    = each.key
    Environment = var.environment
  }
}

# Log metric filter: extract error count from structured logs
resource "aws_cloudwatch_log_metric_filter" "pipeline_errors" {
  for_each       = aws_cloudwatch_log_group.pipeline_logs
  name           = "${each.key}-errors"
  pattern        = "{ $.level = "ERROR" }"
  log_group_name = each.value.name

  metric_transformation {
    name          = "PipelineErrors"
    namespace     = "DataPlatform/${var.environment}"
    value         = "1"
    default_value = "0"
    dimensions = {
      Pipeline = each.key
    }
  }
}

Pillar 3: Distributed Tracing

OpenTelemetry for Data Pipelines

Distributed tracing lets you follow a record's journey from source to Gold layer:

# OpenTelemetry Collector config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 512
  resource:
    attributes:
      - key: environment
        value: prod
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: false
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumenting a Spark Pipeline with OTel

# Add OTel Java agent to Spark submit
spark-submit   --conf "spark.driver.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.executor.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.driver.extraJavaOptions=-Dotel.service.name=etl-silver-transform"   --conf "spark.driver.extraJavaOptions=-Dotel.exporter.otlp.endpoint=http://otel-collector:4317"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler=parentbased_traceidratio"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler.arg=0.1"   my-etl-job.jar

Pillar 4: Data Quality Monitoring

This is the pillar most teams skip—and the one that matters most to data consumers.

The Data Quality Dimensions

Loading diagram...

dbt Data Tests

# schema.yml
version: 2

models:
  - name: silver_orders
    description: "Cleaned and deduplicated orders"
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
      - name: created_at
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: hour
              field: created_at
              interval: 6

Volume Anomaly Detection

# SQL: Detect volume anomalies using 7-day rolling stats
WITH daily_volumes AS (
  SELECT
    DATE(created_at) AS dt,
    COUNT(*) AS row_count
  FROM silver.orders
  WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY 1
),
stats AS (
  SELECT
    AVG(row_count) AS mean,
    STDDEV(row_count) AS stddev
  FROM daily_volumes
  WHERE dt < CURRENT_DATE  -- exclude today
)
SELECT
  dv.dt,
  dv.row_count,
  s.mean,
  s.stddev,
  ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) AS z_score,
  CASE WHEN ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) > 3
    THEN 'ANOMALY' ELSE 'OK' END AS status
FROM daily_volumes dv
CROSS JOIN stats s
WHERE dv.dt = CURRENT_DATE;

Alerting Strategy

Alert Routing by Severity

SeverityConditionChannelSLA
P1 - CriticalData > 4h stale, pipeline downPagerDuty + Slack15 min
P2 - HighSchema drift, volume anomaly >3σSlack #data-alerts1 hour
P3 - MediumQuality test failures, slow queriesJira ticket4 hours
P4 - LowCost anomaly, performance regressionEmail digestNext business day

Terraform: CloudWatch Alarm + SNS

resource "aws_cloudwatch_metric_alarm" "pipeline_freshness" {
  alarm_name          = "orders-table-freshness-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DataStalenessSeconds"
  namespace           = "DataPlatform/prod"
  period              = 300
  statistic           = "Maximum"
  threshold           = 14400  # 4 hours
  alarm_description   = "Orders table data is older than 4 hours"
  treat_missing_data  = "breaching"

  alarm_actions = [aws_sns_topic.data_platform_alerts.arn]
  ok_actions    = [aws_sns_topic.data_platform_alerts.arn]

  dimensions = {
    Table = "orders"
  }
}

Harbinger Explorer: Observability Purpose-Built for Data

While Prometheus/Grafana covers infrastructure metrics and dbt covers test-level quality, platforms like Harbinger Explorer bridge the gap—providing real-time data lineage, automated anomaly detection, and cross-pipeline SLO tracking without requiring custom instrumentation per pipeline.

Key capabilities for data platform teams:

  • Automated lineage: Understand downstream impact before making changes
  • Statistical quality baselines: Auto-learns normal volume/distribution ranges
  • Schema change notifications: Instant alerts on upstream schema drift
  • Pipeline health scorecards: SLO compliance at the dataset level

Quick-Start Observability Checklist

# Verify pipeline has exposed Prometheus metrics
curl -s http://spark-driver:4040/metrics/prometheus | grep records_processed

# Check freshness of a critical table
psql -c "SELECT MAX(updated_at), NOW() - MAX(updated_at) AS age FROM silver.orders"

# Run dbt tests
dbt test --select silver_orders --store-failures

# Check Kafka consumer lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092   --group etl-silver-transform --describe | awk '{print $5, $6}'

Summary

Production data platform observability requires all four pillars:

  1. Metrics — SLOs around freshness, throughput, error rate
  2. Logs — Structured JSON with correlation IDs and record counts
  3. Traces — End-to-end lineage with OpenTelemetry
  4. Data Quality — Freshness, volume, schema, and distribution checks at every layer

Start with freshness metrics and dbt tests—they give the highest signal-to-noise ratio. Add distributed tracing once you have stable pipelines you need to optimize.


Try Harbinger Explorer free for 7 days and bring production-grade observability to your cloud data platform—automated anomaly detection, lineage tracking, and SLO dashboards out of the box.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...