Observability for Cloud Data Platforms: The Complete Guide
Observability for Cloud Data Platforms: The Complete Guide
You can't trust data you can't observe. Yet most data teams invest heavily in pipeline building and almost nothing in pipeline monitoring—until something breaks in production and a VP asks why yesterday's revenue report is wrong.
Observability for data platforms goes beyond "is the job running?" It asks: Is the data fresh? Is the schema intact? Are row counts within expected bounds? Is the join quality degrading silently?
This guide covers how to instrument, monitor, and alert across the full data stack.
The Four Pillars of Data Platform Observability
Loading diagram...
| Pillar | What It Tells You | Tooling |
|---|---|---|
| Metrics | System behavior over time | Prometheus, CloudWatch, Datadog |
| Logs | Why something happened | ELK, CloudWatch Logs, Loki |
| Traces | Where time is spent | Jaeger, Tempo, X-Ray |
| Data Quality | Is the data correct? | Great Expectations, dbt tests, custom |
Pillar 1: Metrics
Key Metrics for Data Pipelines
Every pipeline should expose these signals:
# Prometheus metrics to instrument in your pipeline code
metrics:
ingestion:
- records_ingested_total{source, table, environment}
- ingestion_latency_seconds{source, table}
- ingestion_errors_total{source, table, error_type}
- last_successful_run_timestamp{pipeline_id}
transformation:
- records_transformed_total{stage, table}
- records_dropped_total{stage, table, reason}
- transformation_duration_seconds{pipeline_id, stage}
- schema_drift_events_total{table, field}
serving:
- query_latency_p99_seconds{dataset, query_type}
- stale_data_seconds{table} # now - max(updated_at)
- query_error_rate{dataset}
Prometheus + Grafana: Pipeline Dashboard
# docker-compose.yml for local observability stack
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.47.0
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- --storage.tsdb.retention.time=90d
- --web.enable-lifecycle
grafana:
image: grafana/grafana:10.1.0
ports: ["3000:3000"]
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
volumes:
prometheus_data:
grafana_data:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: spark_pipelines
static_configs:
- targets: ['spark-driver:4040']
metrics_path: /metrics/prometheus
- job_name: airflow
static_configs:
- targets: ['airflow-webserver:8080']
metrics_path: /metrics
- job_name: kafka
static_configs:
- targets: ['kafka-broker-1:9308', 'kafka-broker-2:9308']
SLOs for Data Pipelines
Define Service Level Objectives around data freshness and availability:
| Pipeline | SLO | Measurement |
|---|---|---|
| Orders Bronze Landing | < 5 min latency from source | now() - max(event_time) |
| Silver Transform | Completes within 30 min of Bronze | Task duration |
| Gold Aggregates | Available by 06:00 UTC daily | Scheduled completion time |
| ML Feature Store | < 1h staleness | now() - max(feature_timestamp) |
| Revenue Dashboard | 99.5% daily availability | Query success rate |
# Prometheus alert for data freshness SLO breach
groups:
- name: data_freshness
rules:
- alert: OrdersTableStale
expr: |
(time() - data_platform_last_successful_ingestion_timestamp{table="orders"}) > 300
for: 2m
labels:
severity: warning
team: data-platform
annotations:
summary: "Orders table is stale ({{ $value | humanizeDuration }})"
runbook: "https://wiki.internal/runbooks/stale-orders"
- alert: SilverTransformSLOBreach
expr: |
data_platform_transformation_duration_seconds{pipeline="bronze_to_silver_orders"} > 1800
for: 5m
labels:
severity: critical
annotations:
summary: "Silver transform exceeding 30-min SLO"
Pillar 2: Structured Logging
Log Schema for Pipeline Events
Unstructured logs are nearly useless at scale. Enforce a schema:
# Structured log schema (JSON)
log_event:
timestamp: "2024-01-15T08:32:11.421Z"
level: INFO | WARN | ERROR
pipeline_id: "bronze_to_silver_orders"
run_id: "run_20240115_083200"
stage: "read | transform | write | validate"
table: "silver.orders"
records_in: 142847
records_out: 142831
records_dropped: 16
drop_reason: "schema_mismatch"
duration_ms: 28431
environment: "prod"
spark_app_id: "app-20240115083200-0001"
correlation_id: "req-abc123"
Terraform: CloudWatch Log Group with Retention
resource "aws_cloudwatch_log_group" "pipeline_logs" {
for_each = toset(["bronze-ingestion", "silver-transform", "gold-aggregate", "data-quality"])
name = "/data-platform/${var.environment}/${each.key}"
retention_in_days = 90
kms_key_id = aws_kms_key.data_platform.arn
tags = {
Pipeline = each.key
Environment = var.environment
}
}
# Log metric filter: extract error count from structured logs
resource "aws_cloudwatch_log_metric_filter" "pipeline_errors" {
for_each = aws_cloudwatch_log_group.pipeline_logs
name = "${each.key}-errors"
pattern = "{ $.level = "ERROR" }"
log_group_name = each.value.name
metric_transformation {
name = "PipelineErrors"
namespace = "DataPlatform/${var.environment}"
value = "1"
default_value = "0"
dimensions = {
Pipeline = each.key
}
}
}
Pillar 3: Distributed Tracing
OpenTelemetry for Data Pipelines
Distributed tracing lets you follow a record's journey from source to Gold layer:
# OpenTelemetry Collector config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 512
resource:
attributes:
- key: environment
value: prod
action: upsert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: false
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Instrumenting a Spark Pipeline with OTel
# Add OTel Java agent to Spark submit
spark-submit --conf "spark.driver.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar" --conf "spark.executor.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar" --conf "spark.driver.extraJavaOptions=-Dotel.service.name=etl-silver-transform" --conf "spark.driver.extraJavaOptions=-Dotel.exporter.otlp.endpoint=http://otel-collector:4317" --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler=parentbased_traceidratio" --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler.arg=0.1" my-etl-job.jar
Pillar 4: Data Quality Monitoring
This is the pillar most teams skip—and the one that matters most to data consumers.
The Data Quality Dimensions
Loading diagram...
dbt Data Tests
# schema.yml
version: 2
models:
- name: silver_orders
description: "Cleaned and deduplicated orders"
columns:
- name: order_id
tests:
- not_null
- unique
- name: customer_id
tests:
- not_null
- relationships:
to: ref('silver_customers')
field: customer_id
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
- name: status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
- name: created_at
tests:
- not_null
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 6
Volume Anomaly Detection
# SQL: Detect volume anomalies using 7-day rolling stats
WITH daily_volumes AS (
SELECT
DATE(created_at) AS dt,
COUNT(*) AS row_count
FROM silver.orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
),
stats AS (
SELECT
AVG(row_count) AS mean,
STDDEV(row_count) AS stddev
FROM daily_volumes
WHERE dt < CURRENT_DATE -- exclude today
)
SELECT
dv.dt,
dv.row_count,
s.mean,
s.stddev,
ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) AS z_score,
CASE WHEN ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) > 3
THEN 'ANOMALY' ELSE 'OK' END AS status
FROM daily_volumes dv
CROSS JOIN stats s
WHERE dv.dt = CURRENT_DATE;
Alerting Strategy
Alert Routing by Severity
| Severity | Condition | Channel | SLA |
|---|---|---|---|
| P1 - Critical | Data > 4h stale, pipeline down | PagerDuty + Slack | 15 min |
| P2 - High | Schema drift, volume anomaly >3σ | Slack #data-alerts | 1 hour |
| P3 - Medium | Quality test failures, slow queries | Jira ticket | 4 hours |
| P4 - Low | Cost anomaly, performance regression | Email digest | Next business day |
Terraform: CloudWatch Alarm + SNS
resource "aws_cloudwatch_metric_alarm" "pipeline_freshness" {
alarm_name = "orders-table-freshness-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "DataStalenessSeconds"
namespace = "DataPlatform/prod"
period = 300
statistic = "Maximum"
threshold = 14400 # 4 hours
alarm_description = "Orders table data is older than 4 hours"
treat_missing_data = "breaching"
alarm_actions = [aws_sns_topic.data_platform_alerts.arn]
ok_actions = [aws_sns_topic.data_platform_alerts.arn]
dimensions = {
Table = "orders"
}
}
Harbinger Explorer: Observability Purpose-Built for Data
While Prometheus/Grafana covers infrastructure metrics and dbt covers test-level quality, platforms like Harbinger Explorer bridge the gap—providing real-time data lineage, automated anomaly detection, and cross-pipeline SLO tracking without requiring custom instrumentation per pipeline.
Key capabilities for data platform teams:
- Automated lineage: Understand downstream impact before making changes
- Statistical quality baselines: Auto-learns normal volume/distribution ranges
- Schema change notifications: Instant alerts on upstream schema drift
- Pipeline health scorecards: SLO compliance at the dataset level
Quick-Start Observability Checklist
# Verify pipeline has exposed Prometheus metrics
curl -s http://spark-driver:4040/metrics/prometheus | grep records_processed
# Check freshness of a critical table
psql -c "SELECT MAX(updated_at), NOW() - MAX(updated_at) AS age FROM silver.orders"
# Run dbt tests
dbt test --select silver_orders --store-failures
# Check Kafka consumer lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --group etl-silver-transform --describe | awk '{print $5, $6}'
Summary
Production data platform observability requires all four pillars:
- Metrics — SLOs around freshness, throughput, error rate
- Logs — Structured JSON with correlation IDs and record counts
- Traces — End-to-end lineage with OpenTelemetry
- Data Quality — Freshness, volume, schema, and distribution checks at every layer
Start with freshness metrics and dbt tests—they give the highest signal-to-noise ratio. Add distributed tracing once you have stable pipelines you need to optimize.
Try Harbinger Explorer free for 7 days and bring production-grade observability to your cloud data platform—automated anomaly detection, lineage tracking, and SLO dashboards out of the box.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial