Inhaltsverzeichnis23 Abschnitte
- TL;DR
- Die vier Säulen von Data-Platform-Observability
- Säule 1: Metrics
- Key-Metrics für Data-Pipelines
- Prometheus + Grafana: Pipeline-Dashboard
- SLOs für Data-Pipelines
- Säule 2: Structured Logging
- Log-Schema für Pipeline-Events
- Terraform: CloudWatch-Log-Group mit Retention
- Säule 3: Distributed Tracing
- OpenTelemetry für Data-Pipelines
- Spark-Pipeline mit OTel instrumentieren
- Säule 4: Data-Quality-Monitoring
- Die Data-Quality-Dimensionen
- dbt Data-Tests
- Volume-Anomalie-Detection
- Alerting-Strategie
- Alert-Routing nach Severity
- Terraform: CloudWatch-Alarm + SNS
- Harbinger Explorer: Purpose-Built Observability für Data
- Quick-Start-Observability-Checklist
- FAQ
- Summary
Observability für Cloud Data Platforms: Der vollständige Guide
TL;DR
- Data-Platform-Observability hat 4 Säulen: Metrics, Logs, Traces, Data-Quality. Die meisten Teams skippen die letzte — und es ist die wichtigste.
- Start mit Freshness-Metrics und dbt-Tests — highest Signal-to-Noise. Add Distributed-Tracing wenn Pipelines stabil sind.
- SLOs definieren: Freshness, Throughput, Error-Rate. P1-Alerts via PagerDuty, P2 in Slack, P3 als Jira-Ticket.
- Tools: Prometheus + Grafana für Metrics, Loki/CloudWatch für Logs, Jaeger für Traces, dbt + Great Expectations für Quality.
Du kannst Daten nicht trauen, die du nicht observen kannst. Die meisten Data-Teams investieren heavy in Pipeline-Bauen und fast nichts in Pipeline-Monitoring — bis was in Production bricht und ein VP fragt, warum der gestrige Revenue-Report falsch ist.
Observability für Data-Platforms geht jenseits "läuft der Job?". Es fragt: Ist die Daten fresh? Schema intact? Row-Counts in Expected-Range? Degradiert Join-Quality still?
Dieser Guide covert, wie du across deinem Full-Data-Stack instrumentierst, monitorst und alarmiest.
Die vier Säulen von Data-Platform-Observability
graph TD
A[Data Platform Observability] --> B[Metrics]
A --> C[Logs]
A --> D[Distributed Traces]
A --> E[Data Quality]
B --> B1[Throughput, Latency, Error Rate]
C --> C1[Structured Pipeline Events]
D --> D1[End-to-End Job Lineage]
E --> E1[Freshness, Volume, Schema, Distribution]
| Säule | Was es dir sagt | Tooling |
|---|---|---|
| Metrics | System-Behavior über Zeit | Prometheus, CloudWatch, Datadog |
| Logs | Warum was passierte | ELK, CloudWatch Logs, Loki |
| Traces | Wo Zeit gespent wird | Jaeger, Tempo, X-Ray |
| Data Quality | Sind die Daten korrekt? | Great Expectations, dbt-Tests, Custom |
Säule 1: Metrics
Key-Metrics für Data-Pipelines
Jede Pipeline sollte diese Signale exposen:
# Prometheus-Metrics, die du in deinem Pipeline-Code instrumentierst
metrics:
ingestion:
- records_ingested_total{source, table, environment}
- ingestion_latency_seconds{source, table}
- ingestion_errors_total{source, table, error_type}
- last_successful_run_timestamp{pipeline_id}
transformation:
- records_transformed_total{stage, table}
- records_dropped_total{stage, table, reason}
- transformation_duration_seconds{pipeline_id, stage}
- schema_drift_events_total{table, field}
serving:
- query_latency_p99_seconds{dataset, query_type}
- stale_data_seconds{table} # now - max(updated_at)
- query_error_rate{dataset}
Prometheus + Grafana: Pipeline-Dashboard
# docker-compose.yml für lokalen Observability-Stack
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.47.0
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- --storage.tsdb.retention.time=90d
- --web.enable-lifecycle
grafana:
image: grafana/grafana:10.1.0
ports: ["3000:3000"]
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
volumes:
prometheus_data:
grafana_data:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: spark_pipelines
static_configs:
- targets: ['spark-driver:4040']
metrics_path: /metrics/prometheus
- job_name: airflow
static_configs:
- targets: ['airflow-webserver:8080']
metrics_path: /metrics
- job_name: kafka
static_configs:
- targets: ['kafka-broker-1:9308', 'kafka-broker-2:9308']
SLOs für Data-Pipelines
Definier Service-Level-Objectives um Daten-Freshness und Verfügbarkeit:
| Pipeline | SLO | Messung |
|---|---|---|
| Orders Bronze Landing | < 5 min Latency von Source | now() - max(event_time) |
| Silver Transform | Completes innerhalb 30 min nach Bronze | Task-Duration |
| Gold Aggregates | Verfügbar bis 06:00 UTC täglich | Scheduled-Completion-Time |
| ML Feature Store | < 1h Staleness | now() - max(feature_timestamp) |
| Revenue Dashboard | 99,5% daily Verfügbarkeit | Query-Success-Rate |
# Prometheus-Alert für Daten-Freshness-SLO-Breach
groups:
- name: data_freshness
rules:
- alert: OrdersTableStale
expr: |
(time() - data_platform_last_successful_ingestion_timestamp{table="orders"}) > 300
for: 2m
labels:
severity: warning
team: data-platform
annotations:
summary: "Orders table is stale ({{ $value | humanizeDuration }})"
runbook: "https://wiki.internal/runbooks/stale-orders"
- alert: SilverTransformSLOBreach
expr: |
data_platform_transformation_duration_seconds{pipeline="bronze_to_silver_orders"} > 1800
for: 5m
labels:
severity: critical
annotations:
summary: "Silver transform exceeding 30-min SLO"
Säule 2: Structured Logging
Log-Schema für Pipeline-Events
Unstructured Logs sind at-Scale fast useless. Enforce ein Schema:
# Structured-Log-Schema (JSON)
log_event:
timestamp: "2024-01-15T08:32:11.421Z"
level: INFO | WARN | ERROR
pipeline_id: "bronze_to_silver_orders"
run_id: "run_20240115_083200"
stage: "read | transform | write | validate"
table: "silver.orders"
records_in: 142847
records_out: 142831
records_dropped: 16
drop_reason: "schema_mismatch"
duration_ms: 28431
environment: "prod"
spark_app_id: "app-20240115083200-0001"
correlation_id: "req-abc123"
Terraform: CloudWatch-Log-Group mit Retention
resource "aws_cloudwatch_log_group" "pipeline_logs" {
for_each = toset(["bronze-ingestion", "silver-transform", "gold-aggregate", "data-quality"])
name = "/data-platform/${var.environment}/${each.key}"
retention_in_days = 90
kms_key_id = aws_kms_key.data_platform.arn
tags = {
Pipeline = each.key
Environment = var.environment
}
}
# Log-Metric-Filter: Error-Count aus structured Logs extrahieren
resource "aws_cloudwatch_log_metric_filter" "pipeline_errors" {
for_each = aws_cloudwatch_log_group.pipeline_logs
name = "${each.key}-errors"
pattern = "{ $.level = "ERROR" }"
log_group_name = each.value.name
metric_transformation {
name = "PipelineErrors"
namespace = "DataPlatform/${var.environment}"
value = "1"
default_value = "0"
dimensions = {
Pipeline = each.key
}
}
}
Säule 3: Distributed Tracing
OpenTelemetry für Data-Pipelines
Distributed-Tracing lässt dich der Record-Journey von Source zu Gold-Layer folgen:
# OpenTelemetry-Collector-Config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 512
resource:
attributes:
- key: environment
value: prod
action: upsert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: false
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Spark-Pipeline mit OTel instrumentieren
# OTel-Java-Agent zum Spark-Submit adden
spark-submit --conf "spark.driver.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar" --conf "spark.executor.extraJavaOptions=-javaagent:/opt/otel-javaagent.jar" --conf "spark.driver.extraJavaOptions=-Dotel.service.name=etl-silver-transform" --conf "spark.driver.extraJavaOptions=-Dotel.exporter.otlp.endpoint=http://otel-collector:4317" --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler=parentbased_traceidratio" --conf "spark.driver.extraJavaOptions=-Dotel.traces.sampler.arg=0.1" my-etl-job.jar
Säule 4: Data-Quality-Monitoring
Die Säule, die die meisten Teams skippen — und die am meisten für Daten-Consumer zählt.
Die Data-Quality-Dimensionen
graph LR
A[Data Quality] --> B[Freshness
How recent is it?]
A --> C[Volume
Are row counts normal?]
A --> D[Schema
Did structure change?]
A --> E[Distribution
Are values in expected range?]
A --> F[Referential Integrity
Do FKs resolve?]
A --> G[Uniqueness
Are there unexpected duplicates?]
dbt Data-Tests
# schema.yml
version: 2
models:
- name: silver_orders
description: "Cleaned and deduplicated orders"
columns:
- name: order_id
tests:
- not_null
- unique
- name: customer_id
tests:
- not_null
- relationships:
to: ref('silver_customers')
field: customer_id
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
- name: status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
- name: created_at
tests:
- not_null
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 6
Volume-Anomalie-Detection
# SQL: Volume-Anomalien mit 7-Tage-Rolling-Stats detecten
WITH daily_volumes AS (
SELECT
DATE(created_at) AS dt,
COUNT(*) AS row_count
FROM silver.orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
),
stats AS (
SELECT
AVG(row_count) AS mean,
STDDEV(row_count) AS stddev
FROM daily_volumes
WHERE dt < CURRENT_DATE -- exclude today
)
SELECT
dv.dt,
dv.row_count,
s.mean,
s.stddev,
ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) AS z_score,
CASE WHEN ABS(dv.row_count - s.mean) / NULLIF(s.stddev, 0) > 3
THEN 'ANOMALY' ELSE 'OK' END AS status
FROM daily_volumes dv
CROSS JOIN stats s
WHERE dv.dt = CURRENT_DATE;
Alerting-Strategie
Alert-Routing nach Severity
| Severity | Bedingung | Channel | SLA |
|---|---|---|---|
| P1 - Critical | Daten > 4h stale, Pipeline down | PagerDuty + Slack | 15 min |
| P2 - High | Schema-Drift, Volume-Anomalie >3σ | Slack #data-alerts | 1 Stunde |
| P3 - Medium | Quality-Test-Failures, slow Queries | Jira-Ticket | 4 Stunden |
| P4 - Low | Cost-Anomalie, Performance-Regression | E-Mail-Digest | Nächster Werktag |
Terraform: CloudWatch-Alarm + SNS
resource "aws_cloudwatch_metric_alarm" "pipeline_freshness" {
alarm_name = "orders-table-freshness-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "DataStalenessSeconds"
namespace = "DataPlatform/prod"
period = 300
statistic = "Maximum"
threshold = 14400 # 4 Stunden
alarm_description = "Orders table data is older than 4 hours"
treat_missing_data = "breaching"
alarm_actions = [aws_sns_topic.data_platform_alerts.arn]
ok_actions = [aws_sns_topic.data_platform_alerts.arn]
dimensions = {
Table = "orders"
}
}
Harbinger Explorer: Purpose-Built Observability für Data
Während Prometheus/Grafana Infra-Metrics covert und dbt Test-Level-Quality covert, bridged Platforms wie Harbinger Explorer die Lücke — Real-Time-Data-Lineage, Auto-Anomalie-Detection und Cross-Pipeline-SLO-Tracking ohne Custom-Instrumentation pro Pipeline.
Key-Capabilities für Data-Platform-Teams:
- Auto-Lineage: Verstehen von Downstream-Impact, bevor du änderst
- Statistical Quality-Baselines: Auto-learnt Normal-Volume/Distribution-Ranges
- Schema-Change-Notifications: Instant-Alerts bei Upstream-Schema-Drift
- Pipeline-Health-Scorecards: SLO-Compliance auf Dataset-Level
Quick-Start-Observability-Checklist
# Verify Pipeline hat exposed Prometheus-Metrics
curl -s http://spark-driver:4040/metrics/prometheus | grep records_processed
# Check Freshness einer Critical-Table
psql -c "SELECT MAX(updated_at), NOW() - MAX(updated_at) AS age FROM silver.orders"
# dbt-Tests laufen lassen
dbt test --select silver_orders --store-failures
# Kafka-Consumer-Lag checken
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --group etl-silver-transform --describe | awk '{print $5, $6}'
FAQ
Wo starte ich mit Observability, wenn ich noch nichts habe? Mit Freshness-Metriken. Eine Metrik pro Critical-Table, die misst, wie alt der neueste Record ist. Setze einen 4-Stunden-Threshold und alerte. Hat das höchste Signal-to-Noise von allem.
Prometheus + Grafana oder Datadog? Self-Hosted Prometheus/Grafana: ~200-500 EUR/Monat in Infra für Teams unter 50. Datadog: 15-23 USD/Host/Monat, kann schnell vier-stellig pro Monat werden. Datadog wenn du keine Ops-Kapazität hast, sonst Prometheus.
Brauche ich Distributed-Tracing wirklich? Erst wenn deine Pipelines komplex genug sind, dass du nicht mehr nachvollziehen kannst, warum eine bestimmte Pipeline-Run langsam war. Bei < 20 Pipelines reicht meist gute Logs + Metrics.
Wie schreibe ich SLOs, ohne dass das Team sie ignoriert? SLOs müssen erreichbar sein und Business-Value reflektieren. Start mit 99% (3 Tage-Margin/Monat), nicht 99,9%. Misst, was Stakeholder tatsächlich kümmern (Dashboard verfügbar, Daten fresh) — nicht was easy zu messen ist.
Summary
Production-Data-Platform-Observability braucht alle vier Säulen:
- Metrics — SLOs um Freshness, Throughput, Error-Rate
- Logs — Structured JSON mit Correlation-IDs und Record-Counts
- Traces — End-to-End-Lineage mit OpenTelemetry
- Data-Quality — Freshness, Volume, Schema, Distribution-Checks auf jedem Layer
Start mit Freshness-Metrics und dbt-Tests — sie geben das höchste Signal-to-Noise-Ratio. Add Distributed-Tracing wenn du stable Pipelines hast, die du optimieren musst.
Harbinger Explorer 7 Tage kostenlos testen und bring Production-grade-Observability zu deiner Cloud-Data-Platform — Auto-Anomalie-Detection, Lineage-Tracking und SLO-Dashboards out-of-the-box.
Stand: 14. Mai 2026.
Geschrieben von
Harbinger Team
Cloud-, Data- und AI-Engineer in DACH. Schreibt seit 2018 über infrastrukturkritische Tech-Entscheidungen — keine Marketing- Folien, sondern echte Trade-offs aus Production-Workloads.
Hat dir das geholfen?
Jede Woche ein neuer Artikel über DACH-Cloud, Data und AI — direkt in dein Postfach. Kein Spam, kein Marketing-Sprech.
Kein Spam. 1-Klick-Abmeldung. Datenschutz bei Loops.so.