Designing SLAs for Data Platforms: Reliability Engineering for Data
Designing SLAs for Data Platforms: Reliability Engineering for Data
Site Reliability Engineering (SRE) gave us a principled framework for web service reliability: SLIs, SLOs, and SLAs, grounded in error budgets and blameless post-mortems. Data platforms need the same rigour — but the concepts don't map directly. A data pipeline is not an API endpoint. Its "availability" is more nuanced, and its failures are often silent.
This guide adapts reliability engineering for data platforms, providing a concrete framework that platform engineers can implement and enforce.
Why Data Platforms Need SLAs
Data pipelines fail in ways that are uniquely dangerous:
| Failure Mode | Why It's Dangerous |
|---|---|
| Late data | Dashboards show stale metrics; decisions made on yesterday's data |
| Silent data loss | Rows dropped without error; no alert fires |
| Schema drift | Downstream queries break; reports show nulls |
| Duplicate records | Analytics overcounts; financial reports are wrong |
| Data quality degradation | Slowly corrupted data; no single failure event to diagnose |
Traditional infrastructure SLAs (uptime %) don't capture these. You need data-specific SLIs.
The SLI/SLO/SLA Framework for Data
Definitions
SLI (Service Level Indicator)
↓ measured by
SLO (Service Level Objective)
↓ commitment basis for
SLA (Service Level Agreement)
↓ breach triggers
Consequences (credits, escalations, etc.)
Data SLI Taxonomy
| SLI Category | Example Metric | Measurement Method |
|---|---|---|
| Freshness | Max age of latest row in fact table | NOW() - MAX(updated_at) |
| Completeness | % of expected records received | Row count vs. source system |
| Accuracy | % of records passing validation rules | dbt test pass rate |
| Consistency | Referential integrity violations | FK constraint checks |
| Timeliness | Pipeline completion time vs. SLA window | Airflow task duration |
| Availability | % of time tables are queryable | Probe query success rate |
Defining SLOs
SLOs should be set at the business impact level, not the infrastructure level. Work backwards from how consumers use your data:
Example SLO Table
| Table / Dataset | Freshness SLO | Completeness SLO | Accuracy SLO | Measurement Window |
|---|---|---|---|---|
fact_geopolitical_events | ≤ 15 min lag | ≥ 99.5% of source records | ≥ 99.9% pass validation | Rolling 30 days |
dim_countries | ≤ 24 hours | 100% (bounded set) | 100% | Rolling 30 days |
agg_risk_scores_hourly | ≤ 90 min lag | ≥ 99% of hours populated | ≥ 98% within expected range | Rolling 30 days |
mart_executive_dashboard | ≤ 1 hour lag | ≥ 99.9% | ≥ 99.9% | Rolling 30 days |
Error Budgets
An error budget is the allowed amount of SLO violation:
Error Budget = 1 - SLO target
For freshness SLO of 99.5% (measured over 30 days = 43,200 minutes):
Error Budget = 0.5% × 43,200 = 216 minutes of allowed lag violation per month
When the error budget is exhausted:
- Freeze new feature deployments
- All engineering effort goes to reliability
- Post-mortem required before resuming
Implementing Data SLIs with dbt and SQL
Freshness Check
-- models/monitoring/sli_freshness.sql
SELECT
table_name,
MAX(event_timestamp) AS latest_record,
EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 AS lag_minutes,
CASE
WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 15 THEN 'OK'
WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 30 THEN 'WARNING'
ELSE 'BREACH'
END AS slo_status
FROM (
SELECT 'fact_geopolitical_events' AS table_name, event_timestamp
FROM {{ ref('fact_geopolitical_events') }}
UNION ALL
SELECT 'agg_risk_scores_hourly', hour_timestamp
FROM {{ ref('agg_risk_scores_hourly') }}
) t
GROUP BY table_name;
Completeness Check
-- dbt test: compare row counts between source and target
-- tests/assert_completeness.sql
{% set expected_minimum_rows = 100 %}
SELECT COUNT(*) AS missing_count
FROM (
SELECT event_id FROM {{ source('raw', 'events') }}
EXCEPT
SELECT event_id FROM {{ ref('fact_geopolitical_events') }}
) missing
HAVING COUNT(*) > {{ expected_minimum_rows }};
dbt Generic Tests for Data Quality SLOs
# models/schema.yml
models:
- name: fact_geopolitical_events
description: "Core fact table for geopolitical events"
meta:
slo_freshness_minutes: 15
slo_completeness_pct: 99.5
slo_owner: "platform-team@harbinger.com"
columns:
- name: event_id
tests:
- unique
- not_null
- name: country_code
tests:
- not_null
- accepted_values:
values: "{{ var('valid_country_codes') }}"
- name: severity_score
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0.0
max_value: 10.0
- name: event_timestamp
tests:
- not_null
- dbt_utils.recency:
datepart: minute
interval: 30 # Must have data within 30 minutes
Architecture: SLO Monitoring Pipeline
Loading diagram...
Implementing the SLO Calculator
# slo_calculator.py — runs as a Kubernetes CronJob every 5 minutes
import psycopg2
from datetime import datetime, timedelta, timezone
SLO_CONFIG = {
"fact_geopolitical_events": {
"freshness_slo_minutes": 15,
"completeness_slo_pct": 99.5,
"window_days": 30,
},
"agg_risk_scores_hourly": {
"freshness_slo_minutes": 90,
"completeness_slo_pct": 99.0,
"window_days": 30,
},
}
def calculate_error_budget(table: str, slo_pct: float, conn) -> dict:
# Calculate current error budget consumption for a table
window_start = datetime.now(timezone.utc) - timedelta(days=30)
with conn.cursor() as cur:
sql = (
"SELECT COUNT(*) AS total_measurements, "
"SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) AS passing_measurements, "
"ROUND(100.0 * SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) / COUNT(*), 4) AS actual_slo_pct "
"FROM slo_measurements "
"WHERE table_name = %s AND measured_at >= %s"
)
cur.execute(sql, (table, window_start))
row = cur.fetchone()
total, passing, actual_pct = row
allowed_failures = total * (1 - slo_pct / 100)
actual_failures = total - passing
budget_remaining_pct = max(0, (allowed_failures - actual_failures) / allowed_failures * 100)
return {
"table": table,
"slo_target_pct": slo_pct,
"actual_pct": float(actual_pct),
"budget_remaining_pct": budget_remaining_pct,
"status": "OK" if budget_remaining_pct > 50
else "WARNING" if budget_remaining_pct > 0
else "EXHAUSTED",
}
Alerting Strategy
Alert Fatigue is the Enemy
Don't alert on every SLI deviation — alert on error budget burn rate:
| Burn Rate | Alert Severity | Action |
|---|---|---|
| > 14.4x (budget gone in 2h) | PAGE | Immediate incident response |
| > 6x (budget gone in 5h) | URGENT | Wake on-call within 30 min |
| > 3x (budget gone in 10h) | WARNING | Investigate within 1 hour |
| > 1x (on track to exhaust) | INFO | Review in daily standup |
# Prometheus alert rules for data SLOs
groups:
- name: data_platform_slo
rules:
- alert: DataFreshnessBreachImmediate
expr: |
(
data_slo_error_budget_burn_rate{table="fact_geopolitical_events"} > 14.4
)
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Data SLO error budget burning fast: {{ $labels.table }}"
description: |
Table {{ $labels.table }} is burning error budget at {{ $value }}x rate.
Budget will be exhausted in {{ printf "%.1f" (1 / $value * 720) }} minutes.
Runbook: https://wiki.harbinger.com/runbooks/data-slo-breach
Organisational Practices
SLO Review Cadence
| Cadence | Review |
|---|---|
| Daily | SLO dashboard review in standup (5 min) |
| Weekly | Error budget report: budget consumed, top incidents |
| Monthly | SLO target review: are targets still calibrated to business needs? |
| Quarterly | SLA negotiation with internal stakeholders |
The Data Incident Process
When an SLO breach occurs:
- Detection (automated alert fires)
- Triage (on-call assesses impact: which consumers affected, severity)
- Mitigation (restore data quality; may include backfill, reruns, or rollback)
- Communication (notify data consumers via status page update)
- Post-mortem (within 48 hours; timeline + root cause + action items)
- Action items (tracked in sprint; owner assigned, deadline set)
Post-Mortem Template
## Incident: [Title]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**SLO Breach**: Freshness SLO for `fact_geopolitical_events` (lag exceeded 15min for 3.2 hours)
**Impact**: Risk score dashboard showed stale data; 3 analyst teams affected
### Timeline
- 14:32 — Airflow DAG `event_enrichment` began; upstream API rate-limited
- 14:45 — Lag exceeded 15-minute SLO threshold; alert not fired (bug in alert config)
- 17:12 — On-call noticed stale dashboard during routine check
- 17:18 — Incident declared; pipeline rerun initiated
- 17:41 — Data freshness restored; SLO met
### Root Cause
Upstream news API began rate-limiting at 14:30 due to increased ingestion volume.
Retry logic used exponential backoff but max retries (3) were exhausted silently.
### Contributing Factors
- Alert misconfiguration: `for: 30m` instead of `for: 2m` on freshness alert
- No dead-letter monitoring for failed API calls
### Action Items
| Action | Owner | Due |
|---|---|---|
| Fix alert `for` duration to 2m | @platform-team | 2024-01-20 |
| Add DLQ monitoring for API ingestion | @platform-team | 2024-01-25 |
| Implement adaptive rate limit handling | @data-eng | 2024-02-01 |
SLA Tiers for Internal Consumers
Not all data consumers have the same requirements. Define tiers:
| Tier | Freshness | Completeness | Consumers |
|---|---|---|---|
| Platinum | ≤ 5 min | ≥ 99.9% | Executive dashboards, real-time alerts |
| Gold | ≤ 30 min | ≥ 99.5% | Operational dashboards, analyst reports |
| Silver | ≤ 4 hours | ≥ 99% | Ad-hoc analysis, data science exploration |
| Bronze | ≤ 24 hours | ≥ 98% | Historical archives, compliance reporting |
This tiering allows platform teams to allocate reliability investment proportionally — spending disproportionately on the pipelines that feed Platinum consumers.
Conclusion
Data SLAs done right transform vague complaints about "the data being wrong" into measurable, actionable commitments. The SLI/SLO/error budget framework gives platform teams the language to have honest conversations with stakeholders about the cost of reliability — and the mandate to protect it.
Platforms like Harbinger Explorer depend on this rigour to deliver intelligence that decision-makers can trust. When geopolitical risk scores feed real decisions, "the data was stale" is not an acceptable answer.
Try Harbinger Explorer free for 7 days — intelligence you can rely on, built on SLAs that mean something.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial