cloud-architecture

Published: Apr 3, 2026

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

13 min read·Tags: disaster-recovery, data-platform, rpo-rto, multi-region, backup, resilience

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

Most data platform DR plans exist as a PDF in a shared drive that was last opened during an audit. When a region goes down and an on-call engineer needs to execute a recovery, they discover the runbook references systems that were renamed eight months ago and assumes access to a bastion host that was decommissioned in a cost-cutting initiative.

This guide is about building DR for data platforms that actually works when you need it.

Start with Failure Mode Analysis

Before designing any recovery mechanism, systematically enumerate failure modes. Data platforms fail in ways that differ from transactional systems:

Component	Failure Mode	Impact	Detection Signal
Object storage (S3/GCS)	Regional outage	Complete data lake unavailability	CloudWatch/Cloud Monitoring alerts
Data warehouse	Compute failure	Query unavailability (data intact)	Warehouse health endpoint
Streaming brokers (Kafka)	Broker loss	Consumer lag, message loss risk	Lag monitoring, broker count
Orchestrator (Airflow)	Metadata DB failure	No new pipeline runs	Heartbeat monitoring
ETL compute (Spark/Databricks)	Cluster provisioning failure	Pipeline backlog	Job queue depth
Schema registry	Unavailability	Producer/consumer serialization failure	Registry health check
Data catalog / lineage	Outage	Loss of discovery (not data loss)	Catalog health endpoint

Not all failure modes require DR. Some (warehouse compute failure, orchestrator outage) are availability problems, not data recovery problems. Separate these — they have different playbooks.

Defining RPO and RTO by Data Tier

Don't set a single RPO/RTO for your entire platform. Define tiers based on business criticality:

Loading diagram...

Assign every dataset in your catalog to a tier. This assignment drives replication frequency, backup retention, and recovery priority order. Harbinger Explorer can help maintain this classification at scale by tracking dataset criticality alongside operational metadata.

Object Storage Replication

Cross-Region Replication

For AWS S3:

# Terraform: S3 cross-region replication for data lake
resource "aws_s3_bucket" "data_lake_primary" {
  bucket = "company-data-lake-us-east-1"
  
  versioning {
    enabled = true
  }
}

resource "aws_s3_bucket" "data_lake_replica" {
  provider = aws.us_west_2
  bucket   = "company-data-lake-us-west-2"
  
  versioning {
    enabled = true
  }
}

resource "aws_iam_role" "replication_role" {
  name = "s3-replication-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "s3.amazonaws.com" }
    }]
  })
}

resource "aws_s3_bucket_replication_configuration" "data_lake" {
  role   = aws_iam_role.replication_role.arn
  bucket = aws_s3_bucket.data_lake_primary.id

  rule {
    id     = "replicate-tier1-data"
    status = "Enabled"

    filter {
      prefix = "tier1/"
    }

    destination {
      bucket        = aws_s3_bucket.data_lake_replica.arn
      storage_class = "STANDARD"
      
      replication_time {
        status = "Enabled"
        time {
          minutes = 15
        }
      }
      
      metrics {
        status = "Enabled"
        event_threshold {
          minutes = 15
        }
      }
    }

    delete_marker_replication {
      status = "Enabled"
    }
  }

  rule {
    id     = "replicate-tier2-data"
    status = "Enabled"

    filter {
      prefix = "tier2/"
    }

    destination {
      bucket        = aws_s3_bucket.data_lake_replica.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Point-in-Time Recovery with S3 Versioning

Enable versioning on all Tier 1 and Tier 2 buckets, and enforce lifecycle rules to manage version retention:

# Restore a specific S3 prefix to a point in time
#!/bin/bash
BUCKET="company-data-lake-us-east-1"
PREFIX="tier1/orders/2024/"
TARGET_TIME="2024-03-15T10:00:00Z"
RESTORE_BUCKET="company-data-lake-restore-us-east-1"

aws s3api list-object-versions   --bucket $BUCKET   --prefix $PREFIX   --query "Versions[?LastModified<='${TARGET_TIME}']|[?IsLatest==`true`].[Key,VersionId]"   --output text | while read key version_id; do
    # Copy specific version to restore bucket
    aws s3api copy-object       --copy-source "$BUCKET/$key?versionId=$version_id"       --bucket $RESTORE_BUCKET       --key "$key"
    echo "Restored: $key (version: $version_id)"
done

Kafka Disaster Recovery

Kafka is the highest-risk component in most data platforms for DR purposes. Message loss during a broker outage is permanent without proper replication.

MirrorMaker 2 for Cross-Region Replication

# MirrorMaker 2 configuration for active-passive DR
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: mm2-dr-replication
  namespace: kafka
spec:
  version: 3.7.0
  replicas: 3
  connectCluster: dr-region
  
  clusters:
    - alias: primary
      bootstrapServers: kafka.primary-region.internal:9093
      tls:
        trustedCertificates:
          - secretName: primary-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-primary-user
          certificate: user.crt
          key: user.key
          
    - alias: dr-region
      bootstrapServers: kafka.dr-region.internal:9093
      tls:
        trustedCertificates:
          - secretName: dr-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-dr-user
          certificate: user.crt
          key: user.key

  mirrors:
    - sourceCluster: primary
      targetCluster: dr-region
      sourceConnector:
        tasksMax: 10
        config:
          replication.factor: 3
          offset-syncs.topic.replication.factor: 3
          sync.topic.acls.enabled: "false"
          replication.policy.class: org.apache.kafka.connect.mirror.IdentityReplicationPolicy
      heartbeatConnector:
        config:
          heartbeats.topic.replication.factor: 3
      checkpointConnector:
        config:
          checkpoints.topic.replication.factor: 3
          sync.group.offsets.enabled: "true"
          sync.group.offsets.interval.seconds: "60"
      topicsPattern: "tier1.*|tier2.*"
      groupsPattern: ".*"

The IdentityReplicationPolicy preserves topic names without prefixing — critical for consumer group offset synchronization during failover.

Kafka Failover Runbook

#!/bin/bash
# kafka-failover-runbook.sh
# Prerequisites: kubectl access to DR cluster, mm2 sync lag < retention period

set -euo pipefail

echo "=== KAFKA FAILOVER RUNBOOK ==="
echo "Timestamp: $(date -u)"
echo ""

# Step 1: Verify MirrorMaker2 offset sync is current
echo "Step 1: Checking consumer group offset sync lag..."
kubectl exec -n kafka deploy/kafka-toolbox --   kafka-consumer-groups.sh     --bootstrap-server kafka.dr-region.internal:9092     --group data-platform-consumer     --describe | grep -E "TOPIC|LAG"

read -p "Is lag acceptable? (y/n): " lag_ok
[[ $lag_ok != "y" ]] && echo "ABORT: Lag too high for safe failover" && exit 1

# Step 2: Stop producers on primary (if accessible)
echo "Step 2: Setting producer circuit breaker flag..."
curl -X PATCH https://config.internal/v1/features/kafka_producer_enabled   -d '{"region":"primary","value":false}' || echo "WARNING: Could not reach config service"

# Step 3: Wait for in-flight messages to replicate
echo "Step 3: Waiting 60s for final messages to replicate..."
sleep 60

# Step 4: Translate consumer offsets
echo "Step 4: Translating consumer group offsets..."
kubectl exec -n kafka deploy/kafka-toolbox --   kafka-groups.sh     --bootstrap-server kafka.dr-region.internal:9092     --reset-offsets     --group data-platform-consumer     --from-file /checkpoints/primary.data-platform-consumer.offsets     --execute

# Step 5: Update DNS to point consumers to DR cluster
echo "Step 5: Updating Kafka bootstrap server in Vault..."
vault kv put secret/kafka/bootstrap   servers="kafka.dr-region.internal:9093"   region="dr"   failover_time="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

echo "=== FAILOVER COMPLETE ==="
echo "Monitor consumer lag at: https://monitoring.internal/d/kafka-lag"

Data Warehouse Recovery

For cloud data warehouses (BigQuery, Snowflake, Redshift), the compute layer recovery is typically automatic. The data recovery concern is around:

Accidental deletion / DROP TABLE — handled by Time Travel
Corruption via bad ETL — handled by snapshots + Time Travel
Regional outage — handled by cross-region backup exports

Automated Snapshot Export (BigQuery)

# Daily export of critical tables to DR bucket
#!/bin/bash
TABLES=("orders" "customers" "products" "transactions")
PROJECT="company-prod"
DATASET="analytics"
DR_BUCKET="gs://company-bq-dr-us-west1"
DATE=$(date +%Y/%m/%d)

for TABLE in "${TABLES[@]}"; do
  bq extract     --destination_format PARQUET     --compression SNAPPY     --field_delimiter ,     "${PROJECT}:${DATASET}.${TABLE}"     "${DR_BUCKET}/${TABLE}/${DATE}/${TABLE}_*.parquet"
  
  echo "Exported ${TABLE} to ${DR_BUCKET}/${TABLE}/${DATE}/"
done

Testing Your DR Plan

An untested DR plan is not a DR plan. Run quarterly DR drills with real escalation paths:

Test Type	Frequency	Scope	Success Criterion
Tabletop exercise	Monthly	Team reads through runbook	All steps understood, owner per step
Component restore test	Quarterly	Restore one non-critical dataset	RTO met, data verified
Regional failover drill	Semi-annual	Full DR region activation	RTO met, consumers switched
Chaos injection	Quarterly	Inject failure in staging	System self-heals or alerts within SLA

Document every drill result. A DR plan that consistently meets its RTO in drills has earned stakeholder trust.

Building a DR Dashboard

Operational visibility into DR readiness should be continuous, not just during drills. Monitor:

Replication lag (S3, Kafka, DB)
Last successful backup timestamp per dataset
Time Travel window remaining
Cross-region connectivity health

Platforms like Harbinger Explorer surface these signals as a unified DR health score, giving platform teams early warning when replication falls behind before it becomes a DR event.

Summary

Data platform disaster recovery is a discipline, not a one-time design. The platforms that recover well are the ones that:

Classified their data before designing recovery mechanisms
Implemented replication at every layer (storage, streaming, warehouse)
Wrote runbooks with actual commands, not prose
Tested regularly and documented results

Treat your DR plan as living infrastructure — version-controlled, executable, and continuously validated.

Try Harbinger Explorer free for 7 days — monitor your data platform's DR readiness in real time, track replication health, and get alerted before RPO windows are breached.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

Start with Failure Mode Analysis

Defining RPO and RTO by Data Tier

Object Storage Replication

Cross-Region Replication

Point-in-Time Recovery with S3 Versioning

Kafka Disaster Recovery

MirrorMaker 2 for Cross-Region Replication

Kafka Failover Runbook

Data Warehouse Recovery

Automated Snapshot Export (BigQuery)

Testing Your DR Plan

Building a DR Dashboard

Summary

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

Start with Failure Mode Analysis

Defining RPO and RTO by Data Tier

Object Storage Replication

Cross-Region Replication

Point-in-Time Recovery with S3 Versioning

Kafka Disaster Recovery

MirrorMaker 2 for Cross-Region Replication

Kafka Failover Runbook

Data Warehouse Recovery

Automated Snapshot Export (BigQuery)

Testing Your DR Plan

Building a DR Dashboard

Summary

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette