Harbinger Explorer

Back to Knowledge Hub
cloud-architecture

Cloud-Agnostic Data Lakehouse: Portable Architectures

11 min read·Tags: cloud-architecture, terraform, delta-lake, iceberg, multi-cloud, data-lakehouse, infrastructure-as-code

TL;DR

Cloud lock-in is the silent tax on your data platform. This article lays out a practical architecture for building a cloud-agnostic data lakehouse using Terraform for infrastructure, Delta Lake or Apache Iceberg for the open table format, and abstracted storage and compute layers. You'll get comparison tables, a decision matrix, and an architecture diagram — everything you need to evaluate whether multi-cloud portability is worth the engineering investment for your organization.


The Real Cost of Cloud Lock-in

Every cloud provider wants you all-in. AWS pushes Glue + Athena + Lake Formation. Azure nudges you toward Synapse + Purview + ADLS. GCP pitches BigQuery + Dataplex + GCS. Each stack works well — until your company acquires a subsidiary running on a different cloud, your board mandates a multi-cloud strategy, or your primary provider announces a 30% price increase with 12 months' notice.

Cloud lock-in isn't just about vendor dependency. It manifests in three concrete ways:

  1. Data gravity — Petabytes of data in proprietary formats that cost a fortune to move
  2. Skill lock-in — Teams trained exclusively on one provider's tooling and mental models
  3. Contract leverage — Zero negotiation power when 100% of your workloads sit in one cloud

A cloud-agnostic data lakehouse doesn't mean running everything everywhere simultaneously. It means designing your architecture so that migrating workloads between clouds is an engineering project, not a rewrite.


Open Table Formats: Delta Lake vs Iceberg vs Hudi

The table format is the foundation of portability. If your data is stored in a proprietary format, nothing else matters — you're locked in at the storage layer.

Feature Comparison

DimensionDelta LakeApache IcebergApache Hudi
OriginDatabricks (2019)Netflix → Apache (2018)Uber → Apache (2017)
GovernanceDatabricks-led OSS + proprietary extensionsApache Foundation, vendor-neutralApache Foundation, vendor-neutral
Cloud PortabilityHigh (since UniForm)Highest — designed cloud-agnostic from day oneMedium — strongest on AWS
Engine SupportSpark, Trino, Flink, Presto, DuckDB, PolarisSpark, Trino, Flink, Presto, Dremio, Snowflake, BigQuerySpark, Trino, Flink, Presto
Catalog InteropUnity Catalog, HMS, Glue, Polaris (via UniForm)HMS, Glue, Polaris, Nessie, Unity CatalogHMS, Glue
Schema EvolutionAdd/rename/reorder columnsAdd/rename/reorder/drop columns, partition evolutionAdd columns, limited evolution
Time TravelTransaction log-basedSnapshot-based, immutableTimeline-based
StreamingStructured Streaming nativeFlink integration improving rapidlyStrongest streaming story (MoR tables)
Adoption Trend (2026)Dominant in Databricks shopsFastest-growing, multi-vendor momentumNiche, primarily AWS/Uber ecosystem
UniForm / InteropDelta UniForm reads as IcebergNativeLimited cross-format reads

When to Choose What

Choose Delta Lake when Databricks is your primary compute engine and you want the tightest integration. UniForm now bridges the gap to Iceberg-compatible readers, giving you a reasonable portability story without leaving the Delta ecosystem.

Choose Apache Iceberg when multi-engine access and vendor neutrality are non-negotiable requirements. Iceberg's catalog-level design and partition evolution make it the strongest choice for organizations running workloads across multiple clouds or engines.

Choose Apache Hudi when your primary use case is streaming ingestion with record-level upserts on AWS. Hudi's Merge-on-Read tables still outperform alternatives for high-frequency CDC pipelines — but the ecosystem support gap is widening. [VERIFY]

My take: For new cloud-agnostic architectures in 2026, Iceberg is the pragmatic default. Delta Lake is the right choice if you're already invested in Databricks. Hudi is increasingly hard to justify for greenfield projects.


Terraform for Multi-Cloud Databricks

Terraform is the lingua franca of cloud-agnostic infrastructure. For a multi-cloud lakehouse, it handles the hardest part: making three fundamentally different cloud environments look similar enough to run the same workloads.

What Terraform Manages

A multi-cloud Databricks deployment with Terraform typically covers:

  • Workspace provisioning — Databricks workspaces on AWS, Azure, and GCP with consistent naming, tags, and network configurations
  • Storage backends — S3 buckets, ADLS containers, or GCS buckets with identical IAM policies (translated per cloud)
  • Unity Catalog — Metastore creation, external locations, and credential management across clouds
  • Cluster policies — Standardized compute configurations enforced across all environments
  • Networking — VPC/VNet peering, private endpoints, and firewall rules per cloud

The Module Pattern

The key architectural pattern is a shared module with cloud-specific implementations:

terraform/
├── modules/
│   ├── lakehouse-core/        # Cloud-agnostic: catalog, schemas, permissions
│   ├── lakehouse-aws/         # AWS-specific: S3, IAM roles, VPC
│   ├── lakehouse-azure/       # Azure-specific: ADLS, service principals, VNet
│   └── lakehouse-gcp/         # GCP-specific: GCS, service accounts, VPC
├── environments/
│   ├── aws-prod/
│   ├── azure-prod/
│   └── gcp-staging/
└── variables/
    └── shared.tfvars          # Cross-cloud defaults

The lakehouse-core module defines the logical architecture. The cloud-specific modules translate that into provider-native resources. Environment directories compose them together.

What Terraform Can't Solve

Terraform handles infrastructure, not application logic. You still need:

  • Data replication strategy — How data moves between clouds (if it needs to)
  • Job orchestration — Airflow, Prefect, or Databricks Workflows running cross-cloud DAGs
  • Secret management — Vault or provider-native secret stores with a unified interface
  • Monitoring — Datadog, Grafana, or a provider-agnostic observability stack

Storage Abstraction: S3 / ADLS / GCS

At the storage layer, all three clouds offer object storage that's functionally equivalent for lakehouse workloads. The differences are in naming, authentication, and performance characteristics.

Storage Layer Comparison

DimensionAWS S3Azure ADLS Gen2GCP GCS
Path Formats3://bucket/pathabfss://container@account.dfs.core.windows.net/pathgs://bucket/path
Auth ModelIAM Roles / Instance ProfilesService Principals / Managed IdentityService Accounts / Workload Identity
Hierarchical NamespaceFlat (prefix-based)Native HNSFlat (prefix-based)
ConsistencyStrong (since 2020)StrongStrong
Typical Egress Cost$0.09/GB [PRICING-CHECK]$0.087/GB [PRICING-CHECK]$0.12/GB [PRICING-CHECK]
Cross-Region ReplicationS3 ReplicationGRS / RA-GRSDual/Multi-region buckets
Iceberg SupportNative via Glue/S3 TablesNative via Unity CatalogNative via BigLake

The Abstraction Strategy

Don't build a custom storage abstraction layer. Instead:

  1. Use open table formats — Delta/Iceberg metadata is path-based. The table format abstracts the storage protocol.
  2. Configure storage in the catalog — Unity Catalog external locations or Iceberg catalogs map logical table names to physical cloud paths.
  3. Standardize IAM patterns — Each cloud's authentication is different, but the pattern (service identity → role → storage access) is the same. Terraform modules encode this pattern.

The goal isn't to make S3 and ADLS look identical in code. It's to ensure that switching the storage backend is a Terraform variable change and a data migration, not a rewrite of every pipeline.


Compute Layer: Databricks vs EMR vs Dataproc

Compute is where cloud-agnostic gets expensive. Running the same engine across all three clouds is the simplest path, but it comes with trade-offs.

Compute Comparison

DimensionDatabricks (Multi-Cloud)AWS EMRGCP Dataproc
Available OnAWS, Azure, GCPAWS onlyGCP only
EnginePhoton (optimized Spark)Spark, Hive, Presto, FlinkSpark, Hive, Presto, Flink
Managed Delta/IcebergNative Delta + Iceberg via UniFormIceberg native, Delta via OSSIceberg via BigLake, Delta via OSS
Unity CatalogYes (cross-cloud)NoNo
Auto-ScalingPhoton-optimizedYARN-basedYARN-based
Serverless OptionServerless SQL + JobsEMR ServerlessDataproc Serverless
DBU/Compute Cost$0.07–0.55/DBU [PRICING-CHECK]$0.015–0.27/hr per instance [PRICING-CHECK]$0.01–0.20/hr per vCPU [PRICING-CHECK]
PortabilityHigh (same API across clouds)None (AWS-only)None (GCP-only)

The Pragmatic Choice

Databricks as the cross-cloud compute layer is the most common pattern for organizations that need genuine multi-cloud. The same notebooks, jobs, and SQL warehouses work identically on AWS, Azure, and GCP. Unity Catalog provides a single governance plane across all three.

The trade-off: Databricks pricing premiums over provider-native options. For a 100-node Spark workload, you might pay 20–40% more than running EMR or Dataproc directly. [PRICING-CHECK]

When provider-native compute makes sense: If you're running 80%+ of workloads on one cloud with occasional burst to another, use the native compute service for your primary cloud and accept the migration cost for the rare case.


Catalog Layer: Unity Catalog vs Glue vs Polaris

The catalog is the control plane of your lakehouse. It determines who can see what data, where it lives, and how engines discover it.

Catalog Comparison

DimensionUnity CatalogAWS Glue Data CatalogApache Polaris (Snowflake)
Multi-CloudYes (AWS, Azure, GCP)AWS onlyCloud-agnostic (open source)
Table FormatsDelta, Iceberg (via UniForm)Iceberg, Hudi, Delta (limited)Iceberg only
GovernanceRBAC + ABAC, column masking, row filtersIAM-based, Lake FormationREST catalog spec, engine-level auth
Data LineageBuilt-inNone (use third-party)None
Data SharingDelta Sharing (open protocol)Lake Formation cross-accountIceberg REST catalog protocol
Vendor Lock-in RiskMedium — Databricks-specific but open protocolHigh — AWS-onlyLow — open source Apache project
Maturity (2026)Production-grade, widely adoptedProduction-grade, AWS-dominantEarly production, growing fast [VERIFY]

Catalog Strategy for Multi-Cloud

For genuine cloud-agnostic architectures, two viable patterns exist:

  1. Unity Catalog as the universal plane — Works if Databricks is your primary compute. UC spans all three clouds, provides lineage, and supports Delta Sharing for external consumers. The risk: you're betting on Databricks' continued multi-cloud investment.

  2. Polaris as the open alternative — Apache Polaris implements the Iceberg REST Catalog specification. Any Iceberg-compatible engine can discover and query tables. No vendor dependency, but you lose built-in lineage and governance features. Pair it with OpenMetadata or DataHub for the governance layer.


Decision Matrix: Cloud-Agnostic vs All-In

Not every organization needs a multi-cloud lakehouse. Here's a structured framework for the decision.

FactorFavors Cloud-AgnosticFavors All-In Single Cloud
Regulatory requirementsMulti-region/multi-jurisdiction mandatesSingle-country data residency
M&A activityFrequent acquisitions with mixed cloud estatesStable organizational structure
Vendor negotiationNeed pricing leverage against providersStrong existing enterprise agreement
Team size10+ data engineers who can handle complexitySmall team, need simplicity
Data volume< 50 TB (migration feasible)> 500 TB (data gravity too strong)
Workload diversityMultiple engines needed (Spark, Trino, Flink)Single engine sufficient
Time to marketCan invest 3–6 months in platformNeed production in weeks
Annual cloud spend> $500K (lock-in risk is material)< $100K (portability cost exceeds lock-in cost)

The Honest Assessment

Cloud-agnostic architecture adds 15–30% engineering overhead in the design and build phase. It pays off when:

  • You operate across multiple clouds today (post-M&A is the #1 driver)
  • Your annual cloud spend exceeds $500K and you need negotiation leverage
  • Regulatory requirements mandate geographic or provider diversification

It's often not worth it when:

  • You're a single-cloud shop with < $200K annual spend
  • Your team is small and needs to ship fast
  • Your data gravity (500+ TB in one cloud) makes migration impractical regardless

Architecture Diagram

Loading diagram...

How it flows:

Terraform provisions identical infrastructure patterns across AWS, Azure, and GCP — storage buckets, IAM roles, networking, and compute clusters.

Cloud storage (S3, ADLS, GCS) holds data in open table formats (Iceberg or Delta with UniForm), ensuring any engine can read the data regardless of where it's stored.

③ The catalog layer (Unity Catalog or Polaris) provides a single logical namespace across all clouds, handling discovery, governance, and access control.

Compute engines — Databricks for managed workloads, open-source engines (Trino, Flink, Spark) for specialized or cost-sensitive workloads — query data through the catalog.

Consumers access data through the compute layer, unaware of which cloud the data physically resides on.

🔵 Infrastructure   🟡 Cloud Providers   🟢 Open Source Engines   🔴 Governance   🟣 Table Format


Cost Considerations

Building cloud-agnostic isn't free. Budget for these cost categories:

Cost CategorySingle-Cloud BaselineMulti-Cloud PremiumNotes
Infrastructure (Terraform)1x1.5–2xMaintaining modules for 3 providers [PRICING-CHECK]
Compute (Databricks)$0.07–0.55/DBUSame per-cloud, but higher totalRunning in multiple regions increases cost [PRICING-CHECK]
Data EgressMinimal$0.08–0.12/GB cross-cloudThe silent killer — plan data locality carefully [PRICING-CHECK]
Engineering Overhead1x1.2–1.4xAbstraction layers, testing across clouds
Catalog (Unity Catalog)Included with DatabricksIncluded with DatabricksPolaris is free (OSS) but requires self-hosting
ObservabilityProvider-native (lower cost)Cross-cloud tooling (Datadog, Grafana)Provider-native monitoring doesn't span clouds [PRICING-CHECK]

Rule of thumb: A well-designed cloud-agnostic lakehouse costs 20–35% more than an equivalent single-cloud deployment. The premium decreases as workload volume increases because the portability layer is mostly fixed cost. [PRICING-CHECK]


Where Harbinger Explorer Fits

If you're exploring data across multiple cloud environments during the design phase of a lakehouse architecture, Harbinger Explorer can help. Its browser-based DuckDB engine lets you query CSV exports, API responses, and uploaded datasets from any cloud — no infrastructure setup required. Useful for quick cross-cloud data profiling before committing to a table format or catalog strategy.


Key Takeaways

The cloud-agnostic data lakehouse is a real architecture pattern, not a vendor fantasy — but it demands disciplined engineering. Start with open table formats (Iceberg or Delta with UniForm) as the non-negotiable foundation. Use Terraform to encode infrastructure patterns that translate across clouds. Pick your catalog based on whether you're willing to depend on Databricks (Unity Catalog) or want full independence (Polaris). And be honest about whether your organization actually needs multi-cloud portability — for many teams, the 20–35% cost premium isn't justified.

The best time to design for portability is before you have 500 TB locked into a proprietary format. The second-best time is now.


Continue Reading


Markers

[VERIFY]:

  • Hudi MoR performance claim for high-frequency CDC vs Delta/Iceberg
  • Apache Polaris maturity status in 2026

[PRICING-CHECK]:

  • AWS S3 egress: $0.09/GB
  • Azure ADLS egress: $0.087/GB
  • GCP GCS egress: $0.12/GB
  • Databricks DBU pricing: $0.07–0.55/DBU
  • EMR pricing: $0.015–0.27/hr per instance
  • Dataproc pricing: $0.01–0.20/hr per vCPU
  • Databricks premium over native: 20–40%
  • Multi-cloud Terraform maintenance multiplier: 1.5–2x
  • Cross-cloud data egress: $0.08–0.12/GB
  • Cross-cloud observability tooling cost
  • Overall cloud-agnostic premium: 20–35%

Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...