Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Designing SLAs for Data Platforms: Reliability Engineering for Data

12 min read·Tags: sla, slo, data-quality, reliability, platform-engineering, observability

Designing SLAs for Data Platforms: Reliability Engineering for Data

Site Reliability Engineering (SRE) gave us a principled framework for web service reliability: SLIs, SLOs, and SLAs, grounded in error budgets and blameless post-mortems. Data platforms need the same rigour — but the concepts don't map directly. A data pipeline is not an API endpoint. Its "availability" is more nuanced, and its failures are often silent.

This guide adapts reliability engineering for data platforms, providing a concrete framework that platform engineers can implement and enforce.


Why Data Platforms Need SLAs

Data pipelines fail in ways that are uniquely dangerous:

Failure ModeWhy It's Dangerous
Late dataDashboards show stale metrics; decisions made on yesterday's data
Silent data lossRows dropped without error; no alert fires
Schema driftDownstream queries break; reports show nulls
Duplicate recordsAnalytics overcounts; financial reports are wrong
Data quality degradationSlowly corrupted data; no single failure event to diagnose

Traditional infrastructure SLAs (uptime %) don't capture these. You need data-specific SLIs.


The SLI/SLO/SLA Framework for Data

Definitions

SLI (Service Level Indicator)
  ↓ measured by
SLO (Service Level Objective)
  ↓ commitment basis for
SLA (Service Level Agreement)
  ↓ breach triggers
Consequences (credits, escalations, etc.)

Data SLI Taxonomy

SLI CategoryExample MetricMeasurement Method
FreshnessMax age of latest row in fact tableNOW() - MAX(updated_at)
Completeness% of expected records receivedRow count vs. source system
Accuracy% of records passing validation rulesdbt test pass rate
ConsistencyReferential integrity violationsFK constraint checks
TimelinessPipeline completion time vs. SLA windowAirflow task duration
Availability% of time tables are queryableProbe query success rate

Defining SLOs

SLOs should be set at the business impact level, not the infrastructure level. Work backwards from how consumers use your data:

Example SLO Table

Table / DatasetFreshness SLOCompleteness SLOAccuracy SLOMeasurement Window
fact_geopolitical_events≤ 15 min lag≥ 99.5% of source records≥ 99.9% pass validationRolling 30 days
dim_countries≤ 24 hours100% (bounded set)100%Rolling 30 days
agg_risk_scores_hourly≤ 90 min lag≥ 99% of hours populated≥ 98% within expected rangeRolling 30 days
mart_executive_dashboard≤ 1 hour lag≥ 99.9%≥ 99.9%Rolling 30 days

Error Budgets

An error budget is the allowed amount of SLO violation:

Error Budget = 1 - SLO target

For freshness SLO of 99.5% (measured over 30 days = 43,200 minutes):
Error Budget = 0.5% × 43,200 = 216 minutes of allowed lag violation per month

When the error budget is exhausted:

  • Freeze new feature deployments
  • All engineering effort goes to reliability
  • Post-mortem required before resuming

Implementing Data SLIs with dbt and SQL

Freshness Check

-- models/monitoring/sli_freshness.sql
SELECT
    table_name,
    MAX(event_timestamp) AS latest_record,
    EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 AS lag_minutes,
    CASE
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 15 THEN 'OK'
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(event_timestamp))) / 60 <= 30 THEN 'WARNING'
        ELSE 'BREACH'
    END AS slo_status
FROM (
    SELECT 'fact_geopolitical_events' AS table_name, event_timestamp
    FROM {{ ref('fact_geopolitical_events') }}
    UNION ALL
    SELECT 'agg_risk_scores_hourly', hour_timestamp
    FROM {{ ref('agg_risk_scores_hourly') }}
) t
GROUP BY table_name;

Completeness Check

-- dbt test: compare row counts between source and target
-- tests/assert_completeness.sql
{% set expected_minimum_rows = 100 %}

SELECT COUNT(*) AS missing_count
FROM (
    SELECT event_id FROM {{ source('raw', 'events') }}
    EXCEPT
    SELECT event_id FROM {{ ref('fact_geopolitical_events') }}
) missing
HAVING COUNT(*) > {{ expected_minimum_rows }};

dbt Generic Tests for Data Quality SLOs

# models/schema.yml
models:
  - name: fact_geopolitical_events
    description: "Core fact table for geopolitical events"
    meta:
      slo_freshness_minutes: 15
      slo_completeness_pct: 99.5
      slo_owner: "platform-team@harbinger.com"

    columns:
      - name: event_id
        tests:
          - unique
          - not_null

      - name: country_code
        tests:
          - not_null
          - accepted_values:
              values: "{{ var('valid_country_codes') }}"

      - name: severity_score
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0.0
              max_value: 10.0

      - name: event_timestamp
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: minute
              interval: 30  # Must have data within 30 minutes

Architecture: SLO Monitoring Pipeline

Loading diagram...

Implementing the SLO Calculator

# slo_calculator.py — runs as a Kubernetes CronJob every 5 minutes

import psycopg2
from datetime import datetime, timedelta, timezone

SLO_CONFIG = {
    "fact_geopolitical_events": {
        "freshness_slo_minutes": 15,
        "completeness_slo_pct": 99.5,
        "window_days": 30,
    },
    "agg_risk_scores_hourly": {
        "freshness_slo_minutes": 90,
        "completeness_slo_pct": 99.0,
        "window_days": 30,
    },
}

def calculate_error_budget(table: str, slo_pct: float, conn) -> dict:
    # Calculate current error budget consumption for a table
    window_start = datetime.now(timezone.utc) - timedelta(days=30)
    
    with conn.cursor() as cur:
        sql = (
            "SELECT COUNT(*) AS total_measurements, "
            "SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) AS passing_measurements, "
            "ROUND(100.0 * SUM(CASE WHEN slo_met THEN 1 ELSE 0 END) / COUNT(*), 4) AS actual_slo_pct "
            "FROM slo_measurements "
            "WHERE table_name = %s AND measured_at >= %s"
        )
        cur.execute(sql, (table, window_start))
        
        row = cur.fetchone()
        total, passing, actual_pct = row
        
        allowed_failures = total * (1 - slo_pct / 100)
        actual_failures = total - passing
        budget_remaining_pct = max(0, (allowed_failures - actual_failures) / allowed_failures * 100)
        
        return {
            "table": table,
            "slo_target_pct": slo_pct,
            "actual_pct": float(actual_pct),
            "budget_remaining_pct": budget_remaining_pct,
            "status": "OK" if budget_remaining_pct > 50
                      else "WARNING" if budget_remaining_pct > 0
                      else "EXHAUSTED",
        }

Alerting Strategy

Alert Fatigue is the Enemy

Don't alert on every SLI deviation — alert on error budget burn rate:

Burn RateAlert SeverityAction
> 14.4x (budget gone in 2h)PAGEImmediate incident response
> 6x (budget gone in 5h)URGENTWake on-call within 30 min
> 3x (budget gone in 10h)WARNINGInvestigate within 1 hour
> 1x (on track to exhaust)INFOReview in daily standup
# Prometheus alert rules for data SLOs
groups:
  - name: data_platform_slo
    rules:
      - alert: DataFreshnessBreachImmediate
        expr: |
          (
            data_slo_error_budget_burn_rate{table="fact_geopolitical_events"} > 14.4
          )
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Data SLO error budget burning fast: {{ $labels.table }}"
          description: |
            Table {{ $labels.table }} is burning error budget at {{ $value }}x rate.
            Budget will be exhausted in {{ printf "%.1f" (1 / $value * 720) }} minutes.
            Runbook: https://wiki.harbinger.com/runbooks/data-slo-breach

Organisational Practices

SLO Review Cadence

CadenceReview
DailySLO dashboard review in standup (5 min)
WeeklyError budget report: budget consumed, top incidents
MonthlySLO target review: are targets still calibrated to business needs?
QuarterlySLA negotiation with internal stakeholders

The Data Incident Process

When an SLO breach occurs:

  1. Detection (automated alert fires)
  2. Triage (on-call assesses impact: which consumers affected, severity)
  3. Mitigation (restore data quality; may include backfill, reruns, or rollback)
  4. Communication (notify data consumers via status page update)
  5. Post-mortem (within 48 hours; timeline + root cause + action items)
  6. Action items (tracked in sprint; owner assigned, deadline set)

Post-Mortem Template

## Incident: [Title]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**SLO Breach**: Freshness SLO for `fact_geopolitical_events` (lag exceeded 15min for 3.2 hours)
**Impact**: Risk score dashboard showed stale data; 3 analyst teams affected

### Timeline
- 14:32 — Airflow DAG `event_enrichment` began; upstream API rate-limited
- 14:45 — Lag exceeded 15-minute SLO threshold; alert not fired (bug in alert config)
- 17:12 — On-call noticed stale dashboard during routine check
- 17:18 — Incident declared; pipeline rerun initiated
- 17:41 — Data freshness restored; SLO met

### Root Cause
Upstream news API began rate-limiting at 14:30 due to increased ingestion volume.
Retry logic used exponential backoff but max retries (3) were exhausted silently.

### Contributing Factors
- Alert misconfiguration: `for: 30m` instead of `for: 2m` on freshness alert
- No dead-letter monitoring for failed API calls

### Action Items
| Action | Owner | Due |
|---|---|---|
| Fix alert `for` duration to 2m | @platform-team | 2024-01-20 |
| Add DLQ monitoring for API ingestion | @platform-team | 2024-01-25 |
| Implement adaptive rate limit handling | @data-eng | 2024-02-01 |

SLA Tiers for Internal Consumers

Not all data consumers have the same requirements. Define tiers:

TierFreshnessCompletenessConsumers
Platinum≤ 5 min≥ 99.9%Executive dashboards, real-time alerts
Gold≤ 30 min≥ 99.5%Operational dashboards, analyst reports
Silver≤ 4 hours≥ 99%Ad-hoc analysis, data science exploration
Bronze≤ 24 hours≥ 98%Historical archives, compliance reporting

This tiering allows platform teams to allocate reliability investment proportionally — spending disproportionately on the pipelines that feed Platinum consumers.


Conclusion

Data SLAs done right transform vague complaints about "the data being wrong" into measurable, actionable commitments. The SLI/SLO/error budget framework gives platform teams the language to have honest conversations with stakeholders about the cost of reliability — and the mandate to protect it.

Platforms like Harbinger Explorer depend on this rigour to deliver intelligence that decision-makers can trust. When geopolitical risk scores feed real decisions, "the data was stale" is not an acceptable answer.


Try Harbinger Explorer free for 7 days — intelligence you can rely on, built on SLAs that mean something.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...