Cloud allgemein

Observability für Cloud Data Platforms: Der vollständige Guide

Alles für Production-grade Observability für Cloud Data Platforms — vier Säulen (Metrics, Logs, Traces, Data Quality), OpenTelemetry, Alerting, SLOs.

Harbinger Team3. April 20268 Min. LesezeitAktualisiert 14.5.2026
  • observability
  • monitoring
  • data-quality
  • opentelemetry
  • sre
  • data-platform
Inhaltsverzeichnis23 Abschnitte

Observability für Cloud Data Platforms: Der vollständige Guide

TL;DR

  • Data-Platform-Observability hat 4 Säulen: Metrics, Logs, Traces, Data-Quality. Die meisten Teams skippen die letzte — und es ist die wichtigste.
  • Start mit Freshness-Metrics und dbt-Tests — highest Signal-to-Noise. Add Distributed-Tracing wenn Pipelines stabil sind.
  • SLOs definieren: Freshness, Throughput, Error-Rate. P1-Alerts via PagerDuty, P2 in Slack, P3 als Jira-Ticket.
  • Tools: Prometheus + Grafana für Metrics, Loki/CloudWatch für Logs, Jaeger für Traces, dbt + Great Expectations für Quality.

Du kannst Daten nicht trauen, die du nicht observen kannst. Die meisten Data-Teams investieren heavy in Pipeline-Bauen und fast nichts in Pipeline-Monitoring — bis was in Production bricht und ein VP fragt, warum der gestrige Revenue-Report falsch ist.

Observability für Data-Platforms geht jenseits "läuft der Job?". Es fragt: Ist die Daten fresh? Schema intact? Row-Counts in Expected-Range? Degradiert Join-Quality still?

Dieser Guide covert, wie du across deinem Full-Data-Stack instrumentierst, monitorst und alarmiest.


Die vier Säulen von Data-Platform-Observability

graph TD
    A[Data Platform Observability] --> B[Metrics]
    A --> C[Logs]
    A --> D[Distributed Traces]
    A --> E[Data Quality]
    B --> B1[Throughput, Latency, Error Rate]
    C --> C1[Structured Pipeline Events]
    D --> D1[End-to-End Job Lineage]
    E --> E1[Freshness, Volume, Schema, Distribution]
SäuleWas es dir sagtTooling
MetricsSystem-Behavior über ZeitPrometheus, CloudWatch, Datadog
LogsWarum was passierteELK, CloudWatch Logs, Loki
TracesWo Zeit gespent wirdJaeger, Tempo, X-Ray
Data QualitySind die Daten korrekt?Great Expectations, dbt-Tests, Custom

Säule 1: Metrics

Key-Metrics für Data-Pipelines

Jede Pipeline sollte diese Signale exposen:

# Prometheus-Metrics, die du in deinem Pipeline-Code instrumentierst
metrics:
  ingestion:
    - records_ingested_total{source, table, environment}
    - ingestion_latency_seconds{source, table}
    - ingestion_errors_total{source, table, error_type}
    - last_successful_run_timestamp{pipeline_id}

  transformation:
    - records_transformed_total{stage, table}
    - records_dropped_total{stage, table, reason}
    - transformation_duration_seconds{pipeline_id, stage}
    - schema_drift_events_total{table, field}

  serving:
    - query_latency_p99_seconds{dataset, query_type}
    - stale_data_seconds{table}  # now - max(updated_at)
    - query_error_rate{dataset}

Prometheus + Grafana: Pipeline-Dashboard

# docker-compose.yml für lokalen Observability-Stack
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:v2.47.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - --storage.tsdb.retention.time=90d
      - --web.enable-lifecycle

  grafana:
    image: grafana/grafana:10.1.0
    ports: ["3000:3000"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

volumes:
  prometheus_data:
  grafana_data:
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: spark_pipelines
    static_configs:
      - targets: ['spark-driver:4040']
    metrics_path: /metrics/prometheus

  - job_name: airflow
    static_configs:
      - targets: ['airflow-webserver:8080']
    metrics_path: /metrics

  - job_name: kafka
    static_configs:
      - targets: ['kafka-broker-1:9308', 'kafka-broker-2:9308']

SLOs für Data-Pipelines

Definier Service-Level-Objectives um Daten-Freshness und Verfügbarkeit:

PipelineSLOMessung
Orders Bronze Landing< 5 min Latency von Sourcenow() - max(event_time)
Silver TransformCompletes innerhalb 30 min nach BronzeTask-Duration
Gold AggregatesVerfügbar bis 06:00 UTC täglichScheduled-Completion-Time
ML Feature Store< 1h Stalenessnow() - max(feature_timestamp)
Revenue Dashboard99,5% daily VerfügbarkeitQuery-Success-Rate
# Prometheus-Alert für Daten-Freshness-SLO-Breach
groups:
  - name: data_freshness
    rules:
      - alert: OrdersTableStale
        expr: |
          (time() - data_platform_last_successful_ingestion_timestamp{table="orders"}) > 300
        for: 2m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Orders table is stale ({{ $value | humanizeDuration }})"
          runbook: "https://wiki.internal/runbooks/stale-orders"

      - alert: SilverTransformSLOBreach
        expr: |
          data_platform_transformation_duration_seconds{pipeline="bronze_to_silver_orders"} > 1800
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Silver transform exceeding 30-min SLO"

Säule 2: Structured Logging

Log-Schema für Pipeline-Events

Unstructured Logs sind at-Scale fast useless. Enforce ein Schema:

# Structured-Log-Schema (JSON)
log_event:
  timestamp: "2024-01-15T08:32:11.421Z"
  level: INFO | WARN | ERROR
  pipeline_id: "bronze_to_silver_orders"
  run_id: "run_20240115_083200"
  stage: "read | transform | write | validate"
  table: "silver.orders"
  records_in: 142847
  records_out: 142831
  records_dropped: 16
  drop_reason: "schema_mismatch"
  duration_ms: 28431
  environment: "prod"
  spark_app_id: "app-20240115083200-0001"
  correlation_id: "req-abc123"

Terraform: CloudWatch-Log-Group mit Retention

resource "aws_cloudwatch_log_group" "pipeline_logs" {
  for_each          = toset(["bronze-ingestion", "silver-transform", "gold-aggregate", "data-quality"])
  name              = "/data-platform/${var.environment}/${each.key}"
  retention_in_days = 90
  kms_key_id        = aws_kms_key.data_platform.arn

  tags = {
    Pipeline    = each.key
    Environment = var.environment
  }
}

# Log-Metric-Filter: Error-Count aus structured Logs extrahieren
resource "aws_cloudwatch_log_metric_filter" "pipeline_errors" {
  for_each       = aws_cloudwatch_log_group.pipeline_logs
  name           = "${each.key}-errors"
  pattern        = "{ $.level = "ERROR" }"
  log_group_name = each.value.name

  metric_transformation {
    name          = "PipelineErrors"
    namespace     = "DataPlatform/${var.environment}"
    value         = "1"
    default_value = "0"
    dimensions = {
      Pipeline = each.key
    }
  }
}

Säule 3: Distributed Tracing

OpenTelemetry für Data-Pipelines

Distributed-Tracing lässt dich der Record-Journey von Source zu Gold-Layer folgen:

# OpenTelemetry-Collector-Config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 512
  resource:
    attributes:
      - key: environment
        value: prod
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: false
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Spark-Pipeline mit OTel instrumentieren

# OTel-Java-Agent zum Spark-Submit adden
spark-submit   --conf "spark.driver.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.executor.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar"   --conf "spark.driver.extraJavaOptions=-Dotel.service.name=etl-silver-transform"   --conf "spark.driver.extraJavaOptions=-Dotel.exporter.otlp.endpoint=http://otel-collector:4317"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler=parentbased_traceidratio"   --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler.arg=0.1"   my-etl-job.jar

Säule 4: Data-Quality-Monitoring

Die Säule, die die meisten Teams skippen — und die am meisten für Daten-Consumer zählt.

Die Data-Quality-Dimensionen

graph LR
    A[Data Quality] --> B[Freshness
How recent is it?]
    A --> C[Volume
Are row counts normal?]
    A --> D[Schema
Did structure change?]
    A --> E[Distribution
Are values in expected range?]
    A --> F[Referential Integrity
Do FKs resolve?]
    A --> G[Uniqueness
Are there unexpected duplicates?]

dbt Data-Tests

# schema.yml
version: 2

models:
  - name: silver_orders
    description: "Cleaned and deduplicated orders"
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
      - name: created_at
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: hour
              field: created_at
              interval: 6

Volume-Anomalie-Detection

# SQL: Volume-Anomalien mit 7-Tage-Rolling-Stats detecten
WITH daily_volumes AS (
  SELECT
    DATE(created_at) AS dt,
    COUNT(*) AS row_count
  FROM silver.orders
  WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY 1
),
stats AS (
  SELECT
    AVG(row_count) AS mean,
    STDDEV(row_count) AS stddev
  FROM daily_volumes
  WHERE dt < CURRENT_DATE  -- exclude today
)
SELECT
  dv.dt,
  dv.row_count,
  s.mean,
  s.stddev,
  ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) AS z_score,
  CASE WHEN ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) > 3
    THEN 'ANOMALY' ELSE 'OK' END AS status
FROM daily_volumes dv
CROSS JOIN stats s
WHERE dv.dt = CURRENT_DATE;

Alerting-Strategie

Alert-Routing nach Severity

SeverityBedingungChannelSLA
P1 - CriticalDaten > 4h stale, Pipeline downPagerDuty + Slack15 min
P2 - HighSchema-Drift, Volume-Anomalie >3σSlack #data-alerts1 Stunde
P3 - MediumQuality-Test-Failures, slow QueriesJira-Ticket4 Stunden
P4 - LowCost-Anomalie, Performance-RegressionE-Mail-DigestNächster Werktag

Terraform: CloudWatch-Alarm + SNS

resource "aws_cloudwatch_metric_alarm" "pipeline_freshness" {
  alarm_name          = "orders-table-freshness-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DataStalenessSeconds"
  namespace           = "DataPlatform/prod"
  period              = 300
  statistic           = "Maximum"
  threshold           = 14400  # 4 Stunden
  alarm_description   = "Orders table data is older than 4 hours"
  treat_missing_data  = "breaching"

  alarm_actions = [aws_sns_topic.data_platform_alerts.arn]
  ok_actions    = [aws_sns_topic.data_platform_alerts.arn]

  dimensions = {
    Table = "orders"
  }
}

Harbinger Explorer: Purpose-Built Observability für Data

Während Prometheus/Grafana Infra-Metrics covert und dbt Test-Level-Quality covert, bridged Platforms wie Harbinger Explorer die Lücke — Real-Time-Data-Lineage, Auto-Anomalie-Detection und Cross-Pipeline-SLO-Tracking ohne Custom-Instrumentation pro Pipeline.

Key-Capabilities für Data-Platform-Teams:

  • Auto-Lineage: Verstehen von Downstream-Impact, bevor du änderst
  • Statistical Quality-Baselines: Auto-learnt Normal-Volume/Distribution-Ranges
  • Schema-Change-Notifications: Instant-Alerts bei Upstream-Schema-Drift
  • Pipeline-Health-Scorecards: SLO-Compliance auf Dataset-Level

Quick-Start-Observability-Checklist

# Verify Pipeline hat exposed Prometheus-Metrics
curl -s http://spark-driver:4040/metrics/prometheus | grep records_processed

# Check Freshness einer Critical-Table
psql -c "SELECT MAX(updated_at), NOW() - MAX(updated_at) AS age FROM silver.orders"

# dbt-Tests laufen lassen
dbt test --select silver_orders --store-failures

# Kafka-Consumer-Lag checken
kafka-consumer-groups.sh --bootstrap-server kafka:9092   --group etl-silver-transform --describe | awk '{print $5, $6}'

FAQ

Wo starte ich mit Observability, wenn ich noch nichts habe? Mit Freshness-Metriken. Eine Metrik pro Critical-Table, die misst, wie alt der neueste Record ist. Setze einen 4-Stunden-Threshold und alerte. Hat das höchste Signal-to-Noise von allem.

Prometheus + Grafana oder Datadog? Self-Hosted Prometheus/Grafana: ~200-500 EUR/Monat in Infra für Teams unter 50. Datadog: 15-23 USD/Host/Monat, kann schnell vier-stellig pro Monat werden. Datadog wenn du keine Ops-Kapazität hast, sonst Prometheus.

Brauche ich Distributed-Tracing wirklich? Erst wenn deine Pipelines komplex genug sind, dass du nicht mehr nachvollziehen kannst, warum eine bestimmte Pipeline-Run langsam war. Bei < 20 Pipelines reicht meist gute Logs + Metrics.

Wie schreibe ich SLOs, ohne dass das Team sie ignoriert? SLOs müssen erreichbar sein und Business-Value reflektieren. Start mit 99% (3 Tage-Margin/Monat), nicht 99,9%. Misst, was Stakeholder tatsächlich kümmern (Dashboard verfügbar, Daten fresh) — nicht was easy zu messen ist.


Summary

Production-Data-Platform-Observability braucht alle vier Säulen:

  1. Metrics — SLOs um Freshness, Throughput, Error-Rate
  2. Logs — Structured JSON mit Correlation-IDs und Record-Counts
  3. Traces — End-to-End-Lineage mit OpenTelemetry
  4. Data-Quality — Freshness, Volume, Schema, Distribution-Checks auf jedem Layer

Start mit Freshness-Metrics und dbt-Tests — sie geben das höchste Signal-to-Noise-Ratio. Add Distributed-Tracing wenn du stable Pipelines hast, die du optimieren musst.


Harbinger Explorer 7 Tage kostenlos testen und bring Production-grade-Observability zu deiner Cloud-Data-Platform — Auto-Anomalie-Detection, Lineage-Tracking und SLO-Dashboards out-of-the-box.

Stand: 14. Mai 2026.

H

Geschrieben von

Harbinger Team

Cloud-, Data- und AI-Engineer in DACH. Schreibt seit 2018 über infrastruktur­kritische Tech-Entscheidungen — keine Marketing- Folien, sondern echte Trade-offs aus Production-Workloads.

Hat dir das geholfen?

Jede Woche ein neuer Artikel über DACH-Cloud, Data und AI — direkt in dein Postfach. Kein Spam, kein Marketing-Sprech.

Kein Spam. 1-Klick-Abmeldung. Datenschutz bei Loops.so.