cloud-architecture

Published: Apr 3, 2026

Designing SLAs for Data Platforms: Reliability Engineering for Data

12 min read·Tags: sla, slo, data-quality, reliability, platform-engineering, observability

Designing SLAs for Data Platforms: Reliability Engineering for Data

Site Reliability Engineering (SRE) gave us a principled framework for web service reliability: SLIs, SLOs, and SLAs, grounded in error budgets and blameless post-mortems. Data platforms need the same rigour — but the concepts don't map directly. A data pipeline is not an API endpoint. Its "availability" is more nuanced, and its failures are often silent.

This guide adapts reliability engineering for data platforms, providing a concrete framework that platform engineers can implement and enforce.

Why Data Platforms Need SLAs

Data pipelines fail in ways that are uniquely dangerous:

Failure Mode	Why It's Dangerous
Late data	Dashboards show stale metrics; decisions made on yesterday's data
Silent data loss	Rows dropped without error; no alert fires
Schema drift	Downstream queries break; reports show nulls
Duplicate records	Analytics overcounts; financial reports are wrong
Data quality degradation	Slowly corrupted data; no single failure event to diagnose

Traditional infrastructure SLAs (uptime %) don't capture these. You need data-specific SLIs.

The SLI/SLO/SLA Framework for Data

Definitions

SLI (Service Level Indicator)
  ↓ measured by
SLO (Service Level Objective)
  ↓ commitment basis for
SLA (Service Level Agreement)
  ↓ breach triggers
Consequences (credits, escalations, etc.)

Data SLI Taxonomy

SLI Category	Example Metric	Measurement Method
Freshness	Max age of latest row in fact table	`NOW() - MAX(updated_at)`
Completeness	% of expected records received	Row count vs. source system
Accuracy	% of records passing validation rules	dbt test pass rate
Consistency	Referential integrity violations	FK constraint checks
Timeliness	Pipeline completion time vs. SLA window	Airflow task duration
Availability	% of time tables are queryable	Probe query success rate

Defining SLOs

SLOs should be set at the business impact level, not the infrastructure level. Work backwards from how consumers use your data:

Example SLO Table

Table / Dataset	Freshness SLO	Completeness SLO	Accuracy SLO	Measurement Window
`fact_geopolitical_events`	≤ 15 min lag	≥ 99.5% of source records	≥ 99.9% pass validation	Rolling 30 days
`dim_countries`	≤ 24 hours	100% (bounded set)	100%	Rolling 30 days
`agg_risk_scores_hourly`	≤ 90 min lag	≥ 99% of hours populated	≥ 98% within expected range	Rolling 30 days
`mart_executive_dashboard`	≤ 1 hour lag	≥ 99.9%	≥ 99.9%	Rolling 30 days

Error Budgets

An error budget is the allowed amount of SLO violation:

Error Budget = 1 - SLO target

For freshness SLO of 99.5% (measured over 30 days = 43,200 minutes):
Error Budget = 0.5% × 43,200 = 216 minutes of allowed lag violation per month

When the error budget is exhausted:

Freeze new feature deployments
All engineering effort goes to reliability
Post-mortem required before resuming

Implementing Data SLIs with dbt and SQL

Freshness Check

-- models/monitoring/sli_freshness.sql
SELECT
    table_name,
    MAX(event_timestamp) AS latest_record,
    EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 AS lag_minutes,
    CASE
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 15 THEN 'OK'
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 30 THEN 'WARNING'
        ELSE 'BREACH'
    END AS slo_status
FROM (
    SELECT 'fact_geopolitical_events' AS table_name, event_timestamp
    FROM {{ ref('fact_geopolitical_events') }}
    UNION ALL
    SELECT 'agg_risk_scores_hourly', hour_timestamp
    FROM {{ ref('agg_risk_scores_hourly') }}
) t
GROUP BY table_name;

Completeness Check

-- dbt test: compare row counts between source and target
-- tests/assert_completeness.sql
{% set expected_minimum_rows = 100 %}

SELECT COUNT(*) AS missing_count
FROM (
    SELECT event_id FROM {{ source('raw', 'events') }}
    EXCEPT
    SELECT event_id FROM {{ ref('fact_geopolitical_events') }}
) missing
HAVING COUNT(*) > {{ expected_minimum_rows }};

dbt Generic Tests for Data Quality SLOs

# models/schema.yml
models:
  - name: fact_geopolitical_events
    description: "Core fact table for geopolitical events"
    meta:
      slo_freshness_minutes: 15
      slo_completeness_pct: 99.5
      slo_owner: "platform-team@harbinger.com"

    columns:
      - name: event_id
        tests:
          - unique
          - not_null

      - name: country_code
        tests:
          - not_null
          - accepted_values:
              values: "{{ var('valid_country_codes') }}"

      - name: severity_score
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0.0
              max_value: 10.0

      - name: event_timestamp
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: minute
              interval: 30  # Must have data within 30 minutes

Architecture: SLO Monitoring Pipeline

Loading diagram...

Implementing the SLO Calculator

# slo_calculator.py — runs as a Kubernetes CronJob every 5 minutes

import psycopg2
from datetime import datetime, timedelta, timezone

SLO_CONFIG = {
    "fact_geopolitical_events": {
        "freshness_slo_minutes": 15,
        "completeness_slo_pct": 99.5,
        "window_days": 30,
    },
    "agg_risk_scores_hourly": {
        "freshness_slo_minutes": 90,
        "completeness_slo_pct": 99.0,
        "window_days": 30,
    },
}

def calculate_error_budget(table: str, slo_pct: float, conn) -> dict:
    # Calculate current error budget consumption for a table
    window_start = datetime.now(timezone.utc) - timedelta(days=30)
    
    with conn.cursor() as cur:
        sql = (
            "SELECT COUNT(*) AS total_measurements, "
            "SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) AS passing_measurements, "
            "ROUND(100.0 * SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) / COUNT(*), 4) AS actual_slo_pct "
            "FROM slo_measurements "
            "WHERE table_name = %s AND measured_at >= %s"
        )
        cur.execute(sql, (table, window_start))
        
        row = cur.fetchone()
        total, passing, actual_pct = row
        
        allowed_failures = total * (1 - slo_pct / 100)
        actual_failures = total - passing
        budget_remaining_pct = max(0, (allowed_failures - actual_failures) / allowed_failures * 100)
        
        return {
            "table": table,
            "slo_target_pct": slo_pct,
            "actual_pct": float(actual_pct),
            "budget_remaining_pct": budget_remaining_pct,
            "status": "OK" if budget_remaining_pct > 50
                      else "WARNING" if budget_remaining_pct > 0
                      else "EXHAUSTED",
        }

Alerting Strategy

Alert Fatigue is the Enemy

Don't alert on every SLI deviation — alert on error budget burn rate:

Burn Rate	Alert Severity	Action
> 14.4x (budget gone in 2h)	PAGE	Immediate incident response
> 6x (budget gone in 5h)	URGENT	Wake on-call within 30 min
> 3x (budget gone in 10h)	WARNING	Investigate within 1 hour
> 1x (on track to exhaust)	INFO	Review in daily standup

# Prometheus alert rules for data SLOs
groups:
  - name: data_platform_slo
    rules:
      - alert: DataFreshnessBreachImmediate
        expr: |
          (
            data_slo_error_budget_burn_rate{table="fact_geopolitical_events"} > 14.4
          )
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Data SLO error budget burning fast: {{ $labels.table }}"
          description: |
            Table {{ $labels.table }} is burning error budget at {{ $value }}x rate.
            Budget will be exhausted in {{ printf "%.1f" (1 / $value * 720) }} minutes.
            Runbook: https://wiki.harbinger.com/runbooks/data-slo-breach

Organisational Practices

SLO Review Cadence

Cadence	Review
Daily	SLO dashboard review in standup (5 min)
Weekly	Error budget report: budget consumed, top incidents
Monthly	SLO target review: are targets still calibrated to business needs?
Quarterly	SLA negotiation with internal stakeholders

The Data Incident Process

When an SLO breach occurs:

Detection (automated alert fires)
Triage (on-call assesses impact: which consumers affected, severity)
Mitigation (restore data quality; may include backfill, reruns, or rollback)
Communication (notify data consumers via status page update)
Post-mortem (within 48 hours; timeline + root cause + action items)
Action items (tracked in sprint; owner assigned, deadline set)

Post-Mortem Template

## Incident: [Title]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**SLO Breach**: Freshness SLO for `fact_geopolitical_events` (lag exceeded 15min for 3.2 hours)
**Impact**: Risk score dashboard showed stale data; 3 analyst teams affected

### Timeline
- 14:32 — Airflow DAG `event_enrichment` began; upstream API rate-limited
- 14:45 — Lag exceeded 15-minute SLO threshold; alert not fired (bug in alert config)
- 17:12 — On-call noticed stale dashboard during routine check
- 17:18 — Incident declared; pipeline rerun initiated
- 17:41 — Data freshness restored; SLO met

### Root Cause
Upstream news API began rate-limiting at 14:30 due to increased ingestion volume.
Retry logic used exponential backoff but max retries (3) were exhausted silently.

### Contributing Factors
- Alert misconfiguration: `for: 30m` instead of `for: 2m` on freshness alert
- No dead-letter monitoring for failed API calls

### Action Items
| Action | Owner | Due |
|---|---|---|
| Fix alert `for` duration to 2m | @platform-team | 2024-01-20 |
| Add DLQ monitoring for API ingestion | @platform-team | 2024-01-25 |
| Implement adaptive rate limit handling | @data-eng | 2024-02-01 |

SLA Tiers for Internal Consumers

Not all data consumers have the same requirements. Define tiers:

Tier	Freshness	Completeness	Consumers
Platinum	≤ 5 min	≥ 99.9%	Executive dashboards, real-time alerts
Gold	≤ 30 min	≥ 99.5%	Operational dashboards, analyst reports
Silver	≤ 4 hours	≥ 99%	Ad-hoc analysis, data science exploration
Bronze	≤ 24 hours	≥ 98%	Historical archives, compliance reporting

This tiering allows platform teams to allocate reliability investment proportionally — spending disproportionately on the pipelines that feed Platinum consumers.

Conclusion

Data SLAs done right transform vague complaints about "the data being wrong" into measurable, actionable commitments. The SLI/SLO/error budget framework gives platform teams the language to have honest conversations with stakeholders about the cost of reliability — and the mandate to protect it.

Platforms like Harbinger Explorer depend on this rigour to deliver intelligence that decision-makers can trust. When geopolitical risk scores feed real decisions, "the data was stale" is not an acceptable answer.

Try Harbinger Explorer free for 7 days — intelligence you can rely on, built on SLAs that mean something.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Designing SLAs for Data Platforms: Reliability Engineering for Data

Designing SLAs for Data Platforms: Reliability Engineering for Data

Why Data Platforms Need SLAs

The SLI/SLO/SLA Framework for Data

Definitions

Data SLI Taxonomy

Defining SLOs

Example SLO Table

Error Budgets

Implementing Data SLIs with dbt and SQL

Freshness Check

Completeness Check

dbt Generic Tests for Data Quality SLOs

Architecture: SLO Monitoring Pipeline

Implementing the SLO Calculator

Alerting Strategy

Alert Fatigue is the Enemy

Organisational Practices

SLO Review Cadence

The Data Incident Process

Post-Mortem Template

SLA Tiers for Internal Consumers

Conclusion

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Designing SLAs for Data Platforms: Reliability Engineering for Data

Why Data Platforms Need SLAs

The SLI/SLO/SLA Framework for Data

Definitions

Data SLI Taxonomy

Defining SLOs

Example SLO Table

Error Budgets

Implementing Data SLIs with dbt and SQL

Freshness Check

Completeness Check

dbt Generic Tests for Data Quality SLOs

Architecture: SLO Monitoring Pipeline

Implementing the SLO Calculator

Alerting Strategy

Alert Fatigue is the Enemy

Organisational Practices

SLO Review Cadence

The Data Incident Process

Post-Mortem Template

SLA Tiers for Internal Consumers

Conclusion

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette