Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Data Strategy for Cloud Migrations: A Platform Engineer's Playbook

12 min read·Tags: cloud-migration, data-strategy, platform-engineering, terraform, data-pipelines

Data Strategy for Cloud Migrations: A Platform Engineer's Playbook

Cloud migration projects fail more often at the data layer than anywhere else. Networking, compute, and IAM get thorough attention — but data is often treated as an afterthought, moved in bulk the night before cutover, and prayed over. This guide exists to change that pattern.

Whether you're lifting a 50TB data warehouse from on-prem Oracle to BigQuery, re-platforming a Kafka estate from bare metal to Amazon MSK, or migrating a fleet of Spark ETL jobs to Databricks on Azure, the underlying data strategy questions remain the same: When do you move what? How do you validate it? And what's your rollback plan?


The Four Phases of a Data Migration

Before writing a single line of Terraform, map your migration to four discrete phases. Skipping phases is how projects end up with phantom data loss at 2 AM.

Loading diagram...

Phase 1 — Inventory & Classification

Every byte of data your systems produce falls into one of four categories:

ClassificationDescriptionMigration RiskExample
HotActively read/written, latency-sensitiveHighOLTP tables, event streams
WarmRead frequently, written in batchMediumAggregated reports, feature stores
ColdArchived, rarely readLowCompliance archives, raw event logs
TransientCache, temp tables, in-flight stateN/A (rebuild)Redis caches, Kafka consumer offsets

Classify before you move anything. Hot data needs a live replication strategy. Cold data can be bulk-copied off-hours. Transient data is rebuilt on the target.

Use a combination of query logs, column-level lineage tools, and manual interviews with data consumers to produce this inventory. Harbinger Explorer can accelerate this by scanning metadata across multi-cloud estates and surfacing dependency graphs automatically.

Phase 2 — Dual-Write & Shadow Mode

For hot data, never hard-cutover. Instead, enter a dual-write phase where writes land on both source and target systems simultaneously, and reads continue from the source.

# Example: Debezium CDC connector for dual-write shadow replication
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: orders-cdc-shadow
  labels:
    strimzi.io/cluster: migration-connect
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 4
  config:
    database.hostname: source-postgres.internal
    database.port: "5432"
    database.user: debezium_reader
    database.password: ${env:DEBEZIUM_PASSWORD}
    database.dbname: orders
    database.server.name: orders_shadow
    table.include.list: public.orders,public.order_items,public.customers
    slot.name: debezium_shadow_slot
    publication.autocreate.mode: filtered
    # Write to shadow topic for target ingestion
    topic.prefix: shadow.migration
    transforms: Reroute
    transforms.Reroute.type: io.debezium.transforms.ByLogicalTableRouter
    transforms.Reroute.topic.regex: "shadow\.migration\.public\.(.*)"
    transforms.Reroute.topic.replacement: "target.ingest.$1"

During shadow mode, run reconciliation jobs on a schedule (hourly minimum) comparing row counts, checksums, and sample record comparisons between source and target.

Phase 3 — Cutover & Validation

Cutover is not a moment — it's a window. Define it explicitly in your runbook:

#!/bin/bash
# migration-cutover.sh — execute in a tmux session with logging

set -euo pipefail
LOG_FILE="/var/log/migration/cutover-$(date +%Y%m%d-%H%M%S).log"

echo "=== CUTOVER START: $(date -u) ===" | tee -a $LOG_FILE

# 1. Drain write traffic to source
echo "Step 1: Enabling write-drain flag in feature flag service..." | tee -a $LOG_FILE
curl -X PATCH https://flags.internal/v1/flags/db_write_drain   -H "Content-Type: application/json"   -d '{"enabled": true}' | tee -a $LOG_FILE

# 2. Wait for in-flight transactions to settle
echo "Step 2: Waiting 30s for in-flight writes..." | tee -a $LOG_FILE
sleep 30

# 3. Final reconciliation check
echo "Step 3: Running final reconciliation..." | tee -a $LOG_FILE
python3 /opt/migration/reconcile.py --source postgres://source-db --target bigquery://project/dataset --fail-on-diff

# 4. Switch DNS / connection strings
echo "Step 4: Updating connection string secret in Vault..." | tee -a $LOG_FILE
vault kv put secret/db/orders   connection_string="postgresql://target-db.internal:5432/orders"   migrated_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 5. Enable reads from target
echo "Step 5: Flipping read flag..." | tee -a $LOG_FILE
curl -X PATCH https://flags.internal/v1/flags/db_read_source   -H "Content-Type: application/json"   -d '{"enabled": false}'

echo "=== CUTOVER COMPLETE: $(date -u) ===" | tee -a $LOG_FILE

Phase 4 — Decommission & Observability

Don't decommission source systems until you have 30 days of clean production data flowing through the target. Set up cross-system observability:

# Terraform: CloudWatch metric alarms for post-migration data quality
resource "aws_cloudwatch_metric_alarm" "data_freshness" {
  alarm_name          = "migration-data-freshness-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "MaxAgeMinutes"
  namespace           = "DataPlatform/Migration"
  period              = 300
  statistic           = "Maximum"
  threshold           = 15
  alarm_description   = "Data freshness degraded post-migration — possible pipeline stall"
  alarm_actions       = [aws_sns_topic.oncall.arn]

  dimensions = {
    Dataset = "orders"
    Stage   = "production"
  }
}

Schema Evolution Strategy

Schema changes during migration are a multiplying complexity factor. Every schema migration becomes three problems: the source schema, the migration mapping, and the target schema.

Use a Schema Registry

Whether you're on Avro, Protobuf, or JSON Schema, run a schema registry on both sides of the migration and enforce backward compatibility:

# confluent-schema-registry config snippet
schema.compatibility.level: BACKWARD_TRANSITIVE

BACKWARD_TRANSITIVE means any new schema version can be read by all old consumers — critical when source and target consumers coexist during shadow mode.

Column Mapping Patterns

Source PatternTarget PatternMigration Tool
Camel case columnsSnake casedbt rename macro
Implicit nullabilityExplicit NOT NULLSchema migration script
NUMERIC(18,4)DECIMAL(18,4)Type casting in Spark
Timestamp with TZUTC-normalized TIMESTAMPSpark from_utc_timestamp
Composite PKSurrogate key + composite indexdbt snapshot

Data Validation Framework

The gold standard is a three-tier validation approach:

  1. Structural — Schema matches, no missing columns, types compatible
  2. Statistical — Row counts, null rates, value distributions within tolerance
  3. Semantic — Business rules hold (e.g., order total = sum of line items)
# Lightweight reconciliation using PySpark (Great Expectations alternative)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum as spark_sum, abs as spark_abs

spark = SparkSession.builder.appName("MigrationReconcile").getOrCreate()

source_df = spark.read.format("jdbc").options(
    url="jdbc:postgresql://source-db:5432/orders",
    dbtable="public.orders",
    user="reader",
    password=dbutils.secrets.get("migration", "source-db-password")
).load()

target_df = spark.read.format("bigquery").option("table", "project.dataset.orders").load()

# Structural check
assert set(source_df.columns) == set(target_df.columns), "Column mismatch!"

# Statistical check
source_stats = source_df.agg(
    count("*").alias("row_count"),
    spark_sum("total_amount").alias("total_amount_sum")
).collect()[0]

target_stats = target_df.agg(
    count("*").alias("row_count"),
    spark_sum("total_amount").alias("total_amount_sum")
).collect()[0]

tolerance = 0.001  # 0.1% tolerance
row_diff_pct = abs(source_stats["row_count"] - target_stats["row_count"]) / source_stats["row_count"]
sum_diff_pct = abs(source_stats["total_amount_sum"] - target_stats["total_amount_sum"]) / source_stats["total_amount_sum"]

assert row_diff_pct < tolerance, f"Row count divergence: {row_diff_pct:.4%}"
assert sum_diff_pct < tolerance, f"Sum divergence: {sum_diff_pct:.4%}"

print("✅ Reconciliation passed")

Rollback Planning

Every migration phase needs a rollback procedure documented before cutover begins. A rollback that hasn't been rehearsed in a staging environment is not a rollback plan — it's a wish.

PhaseRollback TriggerRollback ActionRTO
Shadow modeReplication lag > 5minDisable CDC, fix connector10min
CutoverError rate > 1%Revert feature flags2min
Post-cutoverData quality breachRe-enable source, re-open shadow15min

Observability Stack for Migration Projects

Post-migration, your observability should answer: "Is the new platform delivering data with the same quality and freshness as the old one?"

Instrument three signal types:

  • Pipeline latency — p50/p95/p99 end-to-end job duration
  • Data freshness — max age of the latest record in critical tables
  • Error rate — failed job runs as a percentage of total runs

If you're managing multiple migrated workloads across teams, a platform-level view becomes essential. Tools like Harbinger Explorer give you a unified operational view across cloud data assets without requiring per-team instrumentation overhead.


Conclusion

A cloud migration data strategy isn't a one-time document — it's a living operational practice spanning months of careful, phased execution. The teams that succeed treat data migration as a product delivery: they define acceptance criteria, run automated validation, and plan for failure.

The key takeaways:

  • Classify data before moving any of it
  • Use dual-write shadow mode for hot data; never hard-cutover
  • Automate reconciliation — manual spot checks don't scale
  • Define rollback procedures and rehearse them
  • Stay in observability mode for 30 days post-cutover before decommissioning

Try Harbinger Explorer free for 7 days — get unified visibility across your cloud data estate, track migration progress across teams, and catch data quality issues before they reach production.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...