Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

13 min read·Tags: disaster-recovery, data-platform, rpo-rto, multi-region, backup, resilience

Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work

Most data platform DR plans exist as a PDF in a shared drive that was last opened during an audit. When a region goes down and an on-call engineer needs to execute a recovery, they discover the runbook references systems that were renamed eight months ago and assumes access to a bastion host that was decommissioned in a cost-cutting initiative.

This guide is about building DR for data platforms that actually works when you need it.


Start with Failure Mode Analysis

Before designing any recovery mechanism, systematically enumerate failure modes. Data platforms fail in ways that differ from transactional systems:

ComponentFailure ModeImpactDetection Signal
Object storage (S3/GCS)Regional outageComplete data lake unavailabilityCloudWatch/Cloud Monitoring alerts
Data warehouseCompute failureQuery unavailability (data intact)Warehouse health endpoint
Streaming brokers (Kafka)Broker lossConsumer lag, message loss riskLag monitoring, broker count
Orchestrator (Airflow)Metadata DB failureNo new pipeline runsHeartbeat monitoring
ETL compute (Spark/Databricks)Cluster provisioning failurePipeline backlogJob queue depth
Schema registryUnavailabilityProducer/consumer serialization failureRegistry health check
Data catalog / lineageOutageLoss of discovery (not data loss)Catalog health endpoint

Not all failure modes require DR. Some (warehouse compute failure, orchestrator outage) are availability problems, not data recovery problems. Separate these — they have different playbooks.


Defining RPO and RTO by Data Tier

Don't set a single RPO/RTO for your entire platform. Define tiers based on business criticality:

Loading diagram...

Assign every dataset in your catalog to a tier. This assignment drives replication frequency, backup retention, and recovery priority order. Harbinger Explorer can help maintain this classification at scale by tracking dataset criticality alongside operational metadata.


Object Storage Replication

Cross-Region Replication

For AWS S3:

# Terraform: S3 cross-region replication for data lake
resource "aws_s3_bucket" "data_lake_primary" {
  bucket = "company-data-lake-us-east-1"
  
  versioning {
    enabled = true
  }
}

resource "aws_s3_bucket" "data_lake_replica" {
  provider = aws.us_west_2
  bucket   = "company-data-lake-us-west-2"
  
  versioning {
    enabled = true
  }
}

resource "aws_iam_role" "replication_role" {
  name = "s3-replication-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "s3.amazonaws.com" }
    }]
  })
}

resource "aws_s3_bucket_replication_configuration" "data_lake" {
  role   = aws_iam_role.replication_role.arn
  bucket = aws_s3_bucket.data_lake_primary.id

  rule {
    id     = "replicate-tier1-data"
    status = "Enabled"

    filter {
      prefix = "tier1/"
    }

    destination {
      bucket        = aws_s3_bucket.data_lake_replica.arn
      storage_class = "STANDARD"
      
      replication_time {
        status = "Enabled"
        time {
          minutes = 15
        }
      }
      
      metrics {
        status = "Enabled"
        event_threshold {
          minutes = 15
        }
      }
    }

    delete_marker_replication {
      status = "Enabled"
    }
  }

  rule {
    id     = "replicate-tier2-data"
    status = "Enabled"

    filter {
      prefix = "tier2/"
    }

    destination {
      bucket        = aws_s3_bucket.data_lake_replica.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Point-in-Time Recovery with S3 Versioning

Enable versioning on all Tier 1 and Tier 2 buckets, and enforce lifecycle rules to manage version retention:

# Restore a specific S3 prefix to a point in time
#!/bin/bash
BUCKET="company-data-lake-us-east-1"
PREFIX="tier1/orders/2024/"
TARGET_TIME="2024-03-15T10:00:00Z"
RESTORE_BUCKET="company-data-lake-restore-us-east-1"

aws s3api list-object-versions   --bucket $BUCKET   --prefix $PREFIX   --query "Versions[?LastModified<='${TARGET_TIME}']|[?IsLatest==`true`].[Key,VersionId]"   --output text | while read key version_id; do
    # Copy specific version to restore bucket
    aws s3api copy-object       --copy-source "$BUCKET/$key?versionId=$version_id"       --bucket $RESTORE_BUCKET       --key "$key"
    echo "Restored: $key (version: $version_id)"
done

Kafka Disaster Recovery

Kafka is the highest-risk component in most data platforms for DR purposes. Message loss during a broker outage is permanent without proper replication.

MirrorMaker 2 for Cross-Region Replication

# MirrorMaker 2 configuration for active-passive DR
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: mm2-dr-replication
  namespace: kafka
spec:
  version: 3.7.0
  replicas: 3
  connectCluster: dr-region
  
  clusters:
    - alias: primary
      bootstrapServers: kafka.primary-region.internal:9093
      tls:
        trustedCertificates:
          - secretName: primary-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-primary-user
          certificate: user.crt
          key: user.key
          
    - alias: dr-region
      bootstrapServers: kafka.dr-region.internal:9093
      tls:
        trustedCertificates:
          - secretName: dr-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-dr-user
          certificate: user.crt
          key: user.key

  mirrors:
    - sourceCluster: primary
      targetCluster: dr-region
      sourceConnector:
        tasksMax: 10
        config:
          replication.factor: 3
          offset-syncs.topic.replication.factor: 3
          sync.topic.acls.enabled: "false"
          replication.policy.class: org.apache.kafka.connect.mirror.IdentityReplicationPolicy
      heartbeatConnector:
        config:
          heartbeats.topic.replication.factor: 3
      checkpointConnector:
        config:
          checkpoints.topic.replication.factor: 3
          sync.group.offsets.enabled: "true"
          sync.group.offsets.interval.seconds: "60"
      topicsPattern: "tier1.*|tier2.*"
      groupsPattern: ".*"

The IdentityReplicationPolicy preserves topic names without prefixing — critical for consumer group offset synchronization during failover.

Kafka Failover Runbook

#!/bin/bash
# kafka-failover-runbook.sh
# Prerequisites: kubectl access to DR cluster, mm2 sync lag < retention period

set -euo pipefail

echo "=== KAFKA FAILOVER RUNBOOK ==="
echo "Timestamp: $(date -u)"
echo ""

# Step 1: Verify MirrorMaker2 offset sync is current
echo "Step 1: Checking consumer group offset sync lag..."
kubectl exec -n kafka deploy/kafka-toolbox --   kafka-consumer-groups.sh     --bootstrap-server kafka.dr-region.internal:9092     --group data-platform-consumer     --describe | grep -E "TOPIC|LAG"

read -p "Is lag acceptable? (y/n): " lag_ok
[[ $lag_ok != "y" ]] && echo "ABORT: Lag too high for safe failover" && exit 1

# Step 2: Stop producers on primary (if accessible)
echo "Step 2: Setting producer circuit breaker flag..."
curl -X PATCH https://config.internal/v1/features/kafka_producer_enabled   -d '{"region":"primary","value":false}' || echo "WARNING: Could not reach config service"

# Step 3: Wait for in-flight messages to replicate
echo "Step 3: Waiting 60s for final messages to replicate..."
sleep 60

# Step 4: Translate consumer offsets
echo "Step 4: Translating consumer group offsets..."
kubectl exec -n kafka deploy/kafka-toolbox --   kafka-groups.sh     --bootstrap-server kafka.dr-region.internal:9092     --reset-offsets     --group data-platform-consumer     --from-file /checkpoints/primary.data-platform-consumer.offsets     --execute

# Step 5: Update DNS to point consumers to DR cluster
echo "Step 5: Updating Kafka bootstrap server in Vault..."
vault kv put secret/kafka/bootstrap   servers="kafka.dr-region.internal:9093"   region="dr"   failover_time="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

echo "=== FAILOVER COMPLETE ==="
echo "Monitor consumer lag at: https://monitoring.internal/d/kafka-lag"

Data Warehouse Recovery

For cloud data warehouses (BigQuery, Snowflake, Redshift), the compute layer recovery is typically automatic. The data recovery concern is around:

  1. Accidental deletion / DROP TABLE — handled by Time Travel
  2. Corruption via bad ETL — handled by snapshots + Time Travel
  3. Regional outage — handled by cross-region backup exports

Automated Snapshot Export (BigQuery)

# Daily export of critical tables to DR bucket
#!/bin/bash
TABLES=("orders" "customers" "products" "transactions")
PROJECT="company-prod"
DATASET="analytics"
DR_BUCKET="gs://company-bq-dr-us-west1"
DATE=$(date +%Y/%m/%d)

for TABLE in "${TABLES[@]}"; do
  bq extract     --destination_format PARQUET     --compression SNAPPY     --field_delimiter ,     "${PROJECT}:${DATASET}.${TABLE}"     "${DR_BUCKET}/${TABLE}/${DATE}/${TABLE}_*.parquet"
  
  echo "Exported ${TABLE} to ${DR_BUCKET}/${TABLE}/${DATE}/"
done

Testing Your DR Plan

An untested DR plan is not a DR plan. Run quarterly DR drills with real escalation paths:

Test TypeFrequencyScopeSuccess Criterion
Tabletop exerciseMonthlyTeam reads through runbookAll steps understood, owner per step
Component restore testQuarterlyRestore one non-critical datasetRTO met, data verified
Regional failover drillSemi-annualFull DR region activationRTO met, consumers switched
Chaos injectionQuarterlyInject failure in stagingSystem self-heals or alerts within SLA

Document every drill result. A DR plan that consistently meets its RTO in drills has earned stakeholder trust.


Building a DR Dashboard

Operational visibility into DR readiness should be continuous, not just during drills. Monitor:

  • Replication lag (S3, Kafka, DB)
  • Last successful backup timestamp per dataset
  • Time Travel window remaining
  • Cross-region connectivity health

Platforms like Harbinger Explorer surface these signals as a unified DR health score, giving platform teams early warning when replication falls behind before it becomes a DR event.


Summary

Data platform disaster recovery is a discipline, not a one-time design. The platforms that recover well are the ones that:

  • Classified their data before designing recovery mechanisms
  • Implemented replication at every layer (storage, streaming, warehouse)
  • Wrote runbooks with actual commands, not prose
  • Tested regularly and documented results

Treat your DR plan as living infrastructure — version-controlled, executable, and continuously validated.


Try Harbinger Explorer free for 7 days — monitor your data platform's DR readiness in real time, track replication health, and get alerted before RPO windows are breached.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...