Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Hybrid Cloud Data Architecture Patterns

15 min read·Tags: hybrid-cloud, architecture, data-platform, networking, migration, terraform

Hybrid Cloud Data Architecture Patterns

Pure cloud-native is the goal. Reality is messier. Most enterprises operate a hybrid landscape: mainframes that will outlive everyone in the room, on-premise data centers with regulatory constraints, and cloud workloads growing at pace. Hybrid cloud data architecture is the discipline of making these work together—reliably, securely, and without drowning in operational complexity.

This guide covers the patterns that actually work in production hybrid environments.


Why Hybrid? (And Why It's Harder Than It Looks)

Loading diagram...
DriverKeeps Data On-PremisePushes Data to Cloud
LatencyReal-time OT/IoTBatch analytics
RegulationGDPR, data residencyDev/test workloads
CostExisting capexBurst compute
IntegrationLegacy system APIsModern SaaS
SecurityClassified dataNon-sensitive workloads

The Data Gravity Problem

Data gravity: data accumulates services around itself. A 10TB on-premise Oracle DB has decades of stored procedures, ETL jobs, and BI tools attached. You can't "just move it to the cloud." Hybrid architecture respects data gravity while gradually shifting value-add workloads cloud-ward.


Pattern 1: Cloud Bursting for Analytics

Keep your data sources on-premise; run heavy analytics in the cloud using ephemeral compute. Data flows one-way: on-prem → cloud for processing, results flow back.

Loading diagram...

Terraform: AWS Site-to-Site VPN for Secure Transfer

resource "aws_customer_gateway" "on_prem" {
  bgp_asn    = 65000
  ip_address = var.on_prem_public_ip
  type       = "ipsec.1"

  tags = {
    Name = "on-prem-datacenter"
  }
}

resource "aws_vpn_gateway" "cloud" {
  vpc_id = aws_vpc.data_platform.id

  tags = {
    Name = "data-platform-vpn-gw"
  }
}

resource "aws_vpn_connection" "on_prem_to_cloud" {
  vpn_gateway_id      = aws_vpn_gateway.cloud.id
  customer_gateway_id = aws_customer_gateway.on_prem.id
  type                = "ipsec.1"
  static_routes_only  = false

  tunnel1_ike_versions              = ["ikev2"]
  tunnel1_phase1_encryption_algorithms = ["AES256"]
  tunnel1_phase1_integrity_algorithms  = ["SHA2-256"]
  tunnel1_phase2_encryption_algorithms = ["AES256-GCM-16"]
}

AWS DataSync for Bulk Transfer

# Create DataSync location for on-premise NFS
aws datasync create-location-nfs   --server-hostname on-prem-nas.internal   --subdirectory /data/exports   --on-prem-config AgentArns=arn:aws:datasync:us-east-1:123456789:agent/agent-0abc

# Create S3 destination
aws datasync create-location-s3   --s3-bucket-arn arn:aws:s3:::my-platform-bronze   --subdirectory /on-prem-exports   --s3-config BucketAccessRoleArn=arn:aws:iam::123456789:role/DataSyncS3Role

# Create and start transfer task
aws datasync create-task   --source-location-arn arn:aws:datasync:...:location/loc-source   --destination-location-arn arn:aws:datasync:...:location/loc-dest   --name "nightly-export-to-s3"   --options VerifyMode=ONLY_FILES_TRANSFERRED,TransferMode=CHANGED

aws datasync start-task-execution   --task-arn arn:aws:datasync:...:task/task-0abc

Pattern 2: Federated Query

Query data in-place across cloud and on-premise systems without moving it. Best for ad-hoc analytics where data movement is too slow or expensive.

Loading diagram...

Trino Federated Query Config

# catalog/postgresql.properties (on-prem source)
connector.name=postgresql
connection-url=jdbc:postgresql://on-prem-pg.internal:5432/production
connection-user=${ENV:POSTGRES_USER}
connection-password=${ENV:POSTGRES_PASSWORD}

# catalog/hive.properties (cloud S3)
connector.name=hive
hive.metastore=glue
hive.metastore.glue.region=us-east-1
hive.s3.sse.enabled=true
hive.s3.sse.type=KMS
hive.s3.sse.kms-key-id=alias/data-platform-prod

# catalog/iceberg.properties (cloud lakehouse)
connector.name=iceberg
iceberg.catalog.type=glue
iceberg.file-format=PARQUET
-- Federated join: cloud aggregates + on-prem customer master
SELECT
    c.customer_name,
    c.segment,
    s.total_revenue_30d,
    s.order_count_30d
FROM postgresql.public.customers c
JOIN (
    SELECT
        customer_id,
        SUM(amount) AS total_revenue_30d,
        COUNT(*) AS order_count_30d
    FROM hive.gold.daily_order_summary
    WHERE dt >= CURRENT_DATE - INTERVAL '30' DAY
    GROUP BY 1
) s ON c.id = s.customer_id
WHERE c.region = 'EMEA'
ORDER BY s.total_revenue_30d DESC
LIMIT 100;

Pattern 3: Event-Driven Synchronization

For bi-directional sync between on-premise and cloud, use event streaming as the backbone. Both sides publish and consume events; neither is the master.

Loading diagram...

Kafka MirrorMaker 2 Config

# mm2.properties
clusters = on-prem, cloud
on-prem.bootstrap.servers = kafka-on-prem.internal:9092
cloud.bootstrap.servers = kafka-cloud.us-east-1.amazonaws.com:9094

# Replicate on-prem → cloud
on-prem->cloud.enabled = true
on-prem->cloud.topics = orders\..*,inventory\..*,products\..*
on-prem->cloud.topics.blacklist = .*\.internal

# Replicate cloud → on-prem (predictions only)
cloud->on-prem.enabled = true
cloud->on-prem.topics = predictions\..*,enrichments\..*

# Consumer group offset sync
on-prem->cloud.sync.group.offsets.enabled = true
on-prem->cloud.sync.group.offsets.interval.seconds = 60

# Network compression
compression.type = lz4

# Security
on-prem.security.protocol = PLAINTEXT
cloud.security.protocol = SASL_SSL
cloud.sasl.mechanism = AWS_MSK_IAM

Pattern 4: Data Mesh with Hybrid Domains

A data mesh distributes ownership across domains. In a hybrid environment, some domains naturally live on-premise (core banking, ERP) while others live in the cloud (analytics, ML).

Loading diagram...

Domain Ownership in Terraform

# Each domain has its own AWS account + budget
module "finance_domain" {
  source = "./modules/data-domain"

  domain_name  = "finance"
  account_id   = var.finance_account_id
  data_gravity = "on-premise"  # Primary data stays on-prem

  cloud_resources = {
    s3_landing_bucket = true
    glue_catalog      = true
    athena_workgroup  = true
  }

  data_product_topics = [
    "gl_journal_entries",
    "cost_center_hierarchy",
    "currency_exchange_rates"
  ]
}

module "customer_domain" {
  source = "./modules/data-domain"

  domain_name  = "customer"
  account_id   = var.customer_account_id
  data_gravity = "cloud"  # Primary data in cloud

  cloud_resources = {
    s3_landing_bucket = true
    rds_postgres      = true
    redshift_cluster  = true
  }
}

Pattern 5: Progressive Cloud Migration

The "big bang" migration almost always fails. Instead, use a strangler fig pattern: run hybrid while progressively moving workloads.

Migration Phases

PhaseDurationWhat MovesWhat Stays
1 - Shadow1-3 monthsAnalytics replicasAll writes
2 - Read migration3-6 monthsBI queries → cloudTransactional writes
3 - Write migration6-12 monthsNew app writes → cloudLegacy app writes
4 - Cutover1-3 monthsAll trafficArchive only

Dual-Write Pattern for Phase 3

# Application config: dual-write to both systems
data_stores:
  primary:
    type: postgresql
    host: on-prem-pg.internal
    database: production
    role: read_write

  secondary:
    type: postgresql
    host: cloud-rds.us-east-1.rds.amazonaws.com
    database: production
    role: read_write
    lag_threshold_ms: 500  # Alert if secondary lags

dual_write:
  enabled: true
  mode: synchronous  # Both must succeed
  fallback: primary_only  # On secondary failure
  reconciliation_job:
    schedule: "0 */6 * * *"
    alert_on_drift: true

Network Architecture for Hybrid Data

Loading diagram...

Bandwidth Planning

Data VolumeTransfer FrequencyRecommended LinkEstimated Cost
< 100 GB/dayNightly batchSite-to-Site VPN~$50/mo
100 GB - 1 TB/dayHourlyDirect Connect 1G~$200/mo
1-10 TB/dayStreamingDirect Connect 10G~$600/mo
> 10 TB/dayNear-real-timeDirect Connect 100G + DataSync~$2,000/mo

Identity Federation in Hybrid Environments

Don't run two identity systems. Federate on-premise AD/LDAP with cloud IAM.

# AWS IAM Identity Center (SSO) with AD connector
resource "aws_ssoadmin_instance" "main" {}

resource "aws_identitystore_group" "data_engineers" {
  identity_store_id = tolist(aws_ssoadmin_instance.main.identity_store_ids)[0]
  display_name      = "DataEngineers"
  description       = "Data platform engineers - cloud access"
}

# Permission set for data engineers
resource "aws_ssoadmin_permission_set" "data_engineer" {
  instance_arn     = aws_ssoadmin_instance.main.arn
  name             = "DataEngineerAccess"
  session_duration = "PT8H"
}

resource "aws_ssoadmin_managed_policy_attachment" "data_engineer_s3" {
  instance_arn       = aws_ssoadmin_instance.main.arn
  permission_set_arn = aws_ssoadmin_permission_set.data_engineer.arn
  managed_policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}

Monitoring Hybrid Data Flows

Observability gets harder when data crosses network boundaries. Tools like Harbinger Explorer provide unified visibility across hybrid environments—tracking pipeline health, data freshness, and transfer latency regardless of whether your data lives on-premise or in the cloud.

# Monitor cross-boundary transfer latency
aws cloudwatch get-metric-statistics   --namespace AWS/DataSync   --metric-name BytesTransferred   --dimensions Name=TaskId,Value=task-0abc123   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)   --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)   --period 300   --statistics Sum

Summary

Hybrid cloud data architecture isn't a transitional state—for many enterprises, it's the permanent operating model. Design for it deliberately:

  1. Understand your data gravity before deciding what moves
  2. Use event streaming (Kafka + MirrorMaker) for bi-directional sync
  3. Federate queries with Trino/Athena Federation to query in-place
  4. Migrate progressively with shadow mode and dual-write patterns
  5. Federate identity — one IAM to rule them all
  6. Instrument cross-boundary flows with transfer metrics and freshness SLOs

Try Harbinger Explorer free for 7 days and get unified observability across your entire hybrid data landscape—from on-premise RDBMS to cloud lakehouses, in a single pane of glass.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...