cloud-architectureMar 31, 2026

Cloud-Agnostic Data Lakehouse: Portable Architectures

11 min read·Tags: cloud-architecture, terraform, delta-lake, iceberg, multi-cloud, data-lakehouse, infrastructure-as-code

TL;DR

Cloud lock-in is the silent tax on your data platform. This article lays out a practical architecture for building a cloud-agnostic data lakehouse using Terraform for infrastructure, Delta Lake or Apache Iceberg for the open table format, and abstracted storage and compute layers. You'll get comparison tables, a decision matrix, and an architecture diagram — everything you need to evaluate whether multi-cloud portability is worth the engineering investment for your organization.

The Real Cost of Cloud Lock-in

Every cloud provider wants you all-in. AWS pushes Glue + Athena + Lake Formation. Azure nudges you toward Synapse + Purview + ADLS. GCP pitches BigQuery + Dataplex + GCS. Each stack works well — until your company acquires a subsidiary running on a different cloud, your board mandates a multi-cloud strategy, or your primary provider announces a 30% price increase with 12 months' notice.

Cloud lock-in isn't just about vendor dependency. It manifests in three concrete ways:

Data gravity — Petabytes of data in proprietary formats that cost a fortune to move
Skill lock-in — Teams trained exclusively on one provider's tooling and mental models
Contract leverage — Zero negotiation power when 100% of your workloads sit in one cloud

A cloud-agnostic data lakehouse doesn't mean running everything everywhere simultaneously. It means designing your architecture so that migrating workloads between clouds is an engineering project, not a rewrite.

Open Table Formats: Delta Lake vs Iceberg vs Hudi

The table format is the foundation of portability. If your data is stored in a proprietary format, nothing else matters — you're locked in at the storage layer.

Feature Comparison

Dimension	Delta Lake	Apache Iceberg	Apache Hudi
Origin	Databricks (2019)	Netflix → Apache (2018)	Uber → Apache (2017)
Governance	Databricks-led OSS + proprietary extensions	Apache Foundation, vendor-neutral	Apache Foundation, vendor-neutral
Cloud Portability	High (since UniForm)	Highest — designed cloud-agnostic from day one	Medium — strongest on AWS
Engine Support	Spark, Trino, Flink, Presto, DuckDB, Polaris	Spark, Trino, Flink, Presto, Dremio, Snowflake, BigQuery	Spark, Trino, Flink, Presto
Catalog Interop	Unity Catalog, HMS, Glue, Polaris (via UniForm)	HMS, Glue, Polaris, Nessie, Unity Catalog	HMS, Glue
Schema Evolution	Add/rename/reorder columns	Add/rename/reorder/drop columns, partition evolution	Add columns, limited evolution
Time Travel	Transaction log-based	Snapshot-based, immutable	Timeline-based
Streaming	Structured Streaming native	Flink integration improving rapidly	Strongest streaming story (MoR tables)
Adoption Trend (2026)	Dominant in Databricks shops	Fastest-growing, multi-vendor momentum	Niche, primarily AWS/Uber ecosystem
UniForm / Interop	Delta UniForm reads as Iceberg	Native	Limited cross-format reads

When to Choose What

Choose Delta Lake when Databricks is your primary compute engine and you want the tightest integration. UniForm now bridges the gap to Iceberg-compatible readers, giving you a reasonable portability story without leaving the Delta ecosystem.

Choose Apache Iceberg when multi-engine access and vendor neutrality are non-negotiable requirements. Iceberg's catalog-level design and partition evolution make it the strongest choice for organizations running workloads across multiple clouds or engines.

Choose Apache Hudi when your primary use case is streaming ingestion with record-level upserts on AWS. Hudi's Merge-on-Read tables still outperform alternatives for high-frequency CDC pipelines — but the ecosystem support gap is widening. [VERIFY]

My take: For new cloud-agnostic architectures in 2026, Iceberg is the pragmatic default. Delta Lake is the right choice if you're already invested in Databricks. Hudi is increasingly hard to justify for greenfield projects.

Terraform for Multi-Cloud Databricks

Terraform is the lingua franca of cloud-agnostic infrastructure. For a multi-cloud lakehouse, it handles the hardest part: making three fundamentally different cloud environments look similar enough to run the same workloads.

What Terraform Manages

A multi-cloud Databricks deployment with Terraform typically covers:

Workspace provisioning — Databricks workspaces on AWS, Azure, and GCP with consistent naming, tags, and network configurations
Storage backends — S3 buckets, ADLS containers, or GCS buckets with identical IAM policies (translated per cloud)
Unity Catalog — Metastore creation, external locations, and credential management across clouds
Cluster policies — Standardized compute configurations enforced across all environments
Networking — VPC/VNet peering, private endpoints, and firewall rules per cloud

The Module Pattern

The key architectural pattern is a shared module with cloud-specific implementations:

terraform/
├── modules/
│   ├── lakehouse-core/        # Cloud-agnostic: catalog, schemas, permissions
│   ├── lakehouse-aws/         # AWS-specific: S3, IAM roles, VPC
│   ├── lakehouse-azure/       # Azure-specific: ADLS, service principals, VNet
│   └── lakehouse-gcp/         # GCP-specific: GCS, service accounts, VPC
├── environments/
│   ├── aws-prod/
│   ├── azure-prod/
│   └── gcp-staging/
└── variables/
    └── shared.tfvars          # Cross-cloud defaults

The lakehouse-core module defines the logical architecture. The cloud-specific modules translate that into provider-native resources. Environment directories compose them together.

What Terraform Can't Solve

Terraform handles infrastructure, not application logic. You still need:

Data replication strategy — How data moves between clouds (if it needs to)
Job orchestration — Airflow, Prefect, or Databricks Workflows running cross-cloud DAGs
Secret management — Vault or provider-native secret stores with a unified interface
Monitoring — Datadog, Grafana, or a provider-agnostic observability stack

Storage Abstraction: S3 / ADLS / GCS

At the storage layer, all three clouds offer object storage that's functionally equivalent for lakehouse workloads. The differences are in naming, authentication, and performance characteristics.

Storage Layer Comparison

Dimension	AWS S3	Azure ADLS Gen2	GCP GCS
Path Format	`s3://bucket/path`	`abfss://container@account.dfs.core.windows.net/path`	`gs://bucket/path`
Auth Model	IAM Roles / Instance Profiles	Service Principals / Managed Identity	Service Accounts / Workload Identity
Hierarchical Namespace	Flat (prefix-based)	Native HNS	Flat (prefix-based)
Consistency	Strong (since 2020)	Strong	Strong
Typical Egress Cost	$0.09/GB [PRICING-CHECK]	$0.087/GB [PRICING-CHECK]	$0.12/GB [PRICING-CHECK]
Cross-Region Replication	S3 Replication	GRS / RA-GRS	Dual/Multi-region buckets
Iceberg Support	Native via Glue/S3 Tables	Native via Unity Catalog	Native via BigLake

The Abstraction Strategy

Don't build a custom storage abstraction layer. Instead:

Use open table formats — Delta/Iceberg metadata is path-based. The table format abstracts the storage protocol.
Configure storage in the catalog — Unity Catalog external locations or Iceberg catalogs map logical table names to physical cloud paths.
Standardize IAM patterns — Each cloud's authentication is different, but the pattern (service identity → role → storage access) is the same. Terraform modules encode this pattern.

The goal isn't to make S3 and ADLS look identical in code. It's to ensure that switching the storage backend is a Terraform variable change and a data migration, not a rewrite of every pipeline.

Compute Layer: Databricks vs EMR vs Dataproc

Compute is where cloud-agnostic gets expensive. Running the same engine across all three clouds is the simplest path, but it comes with trade-offs.

Compute Comparison

Dimension	Databricks (Multi-Cloud)	AWS EMR	GCP Dataproc
Available On	AWS, Azure, GCP	AWS only	GCP only
Engine	Photon (optimized Spark)	Spark, Hive, Presto, Flink	Spark, Hive, Presto, Flink
Managed Delta/Iceberg	Native Delta + Iceberg via UniForm	Iceberg native, Delta via OSS	Iceberg via BigLake, Delta via OSS
Unity Catalog	Yes (cross-cloud)	No	No
Auto-Scaling	Photon-optimized	YARN-based	YARN-based
Serverless Option	Serverless SQL + Jobs	EMR Serverless	Dataproc Serverless
DBU/Compute Cost	$0.07–0.55/DBU [PRICING-CHECK]	$0.015–0.27/hr per instance [PRICING-CHECK]	$0.01–0.20/hr per vCPU [PRICING-CHECK]
Portability	High (same API across clouds)	None (AWS-only)	None (GCP-only)

The Pragmatic Choice

Databricks as the cross-cloud compute layer is the most common pattern for organizations that need genuine multi-cloud. The same notebooks, jobs, and SQL warehouses work identically on AWS, Azure, and GCP. Unity Catalog provides a single governance plane across all three.

The trade-off: Databricks pricing premiums over provider-native options. For a 100-node Spark workload, you might pay 20–40% more than running EMR or Dataproc directly. [PRICING-CHECK]

When provider-native compute makes sense: If you're running 80%+ of workloads on one cloud with occasional burst to another, use the native compute service for your primary cloud and accept the migration cost for the rare case.

Catalog Layer: Unity Catalog vs Glue vs Polaris

The catalog is the control plane of your lakehouse. It determines who can see what data, where it lives, and how engines discover it.

Catalog Comparison

Dimension	Unity Catalog	AWS Glue Data Catalog	Apache Polaris (Snowflake)
Multi-Cloud	Yes (AWS, Azure, GCP)	AWS only	Cloud-agnostic (open source)
Table Formats	Delta, Iceberg (via UniForm)	Iceberg, Hudi, Delta (limited)	Iceberg only
Governance	RBAC + ABAC, column masking, row filters	IAM-based, Lake Formation	REST catalog spec, engine-level auth
Data Lineage	Built-in	None (use third-party)	None
Data Sharing	Delta Sharing (open protocol)	Lake Formation cross-account	Iceberg REST catalog protocol
Vendor Lock-in Risk	Medium — Databricks-specific but open protocol	High — AWS-only	Low — open source Apache project
Maturity (2026)	Production-grade, widely adopted	Production-grade, AWS-dominant	Early production, growing fast [VERIFY]

Catalog Strategy for Multi-Cloud

For genuine cloud-agnostic architectures, two viable patterns exist:

Unity Catalog as the universal plane — Works if Databricks is your primary compute. UC spans all three clouds, provides lineage, and supports Delta Sharing for external consumers. The risk: you're betting on Databricks' continued multi-cloud investment.
Polaris as the open alternative — Apache Polaris implements the Iceberg REST Catalog specification. Any Iceberg-compatible engine can discover and query tables. No vendor dependency, but you lose built-in lineage and governance features. Pair it with OpenMetadata or DataHub for the governance layer.

Decision Matrix: Cloud-Agnostic vs All-In

Not every organization needs a multi-cloud lakehouse. Here's a structured framework for the decision.

Factor	Favors Cloud-Agnostic	Favors All-In Single Cloud
Regulatory requirements	Multi-region/multi-jurisdiction mandates	Single-country data residency
M&A activity	Frequent acquisitions with mixed cloud estates	Stable organizational structure
Vendor negotiation	Need pricing leverage against providers	Strong existing enterprise agreement
Team size	10+ data engineers who can handle complexity	Small team, need simplicity
Data volume	< 50 TB (migration feasible)	> 500 TB (data gravity too strong)
Workload diversity	Multiple engines needed (Spark, Trino, Flink)	Single engine sufficient
Time to market	Can invest 3–6 months in platform	Need production in weeks
Annual cloud spend	> $500K (lock-in risk is material)	< $100K (portability cost exceeds lock-in cost)

The Honest Assessment

Cloud-agnostic architecture adds 15–30% engineering overhead in the design and build phase. It pays off when:

You operate across multiple clouds today (post-M&A is the #1 driver)
Your annual cloud spend exceeds $500K and you need negotiation leverage
Regulatory requirements mandate geographic or provider diversification

It's often not worth it when:

You're a single-cloud shop with < $200K annual spend
Your team is small and needs to ship fast
Your data gravity (500+ TB in one cloud) makes migration impractical regardless

Architecture Diagram

Loading diagram...

How it flows:

① Terraform provisions identical infrastructure patterns across AWS, Azure, and GCP — storage buckets, IAM roles, networking, and compute clusters.

② Cloud storage (S3, ADLS, GCS) holds data in open table formats (Iceberg or Delta with UniForm), ensuring any engine can read the data regardless of where it's stored.

③ The catalog layer (Unity Catalog or Polaris) provides a single logical namespace across all clouds, handling discovery, governance, and access control.

④ Compute engines — Databricks for managed workloads, open-source engines (Trino, Flink, Spark) for specialized or cost-sensitive workloads — query data through the catalog.

⑤ Consumers access data through the compute layer, unaware of which cloud the data physically resides on.

🔵 Infrastructure 🟡 Cloud Providers 🟢 Open Source Engines 🔴 Governance 🟣 Table Format

Cost Considerations

Building cloud-agnostic isn't free. Budget for these cost categories:

Cost Category	Single-Cloud Baseline	Multi-Cloud Premium	Notes
Infrastructure (Terraform)	1x	1.5–2x	Maintaining modules for 3 providers [PRICING-CHECK]
Compute (Databricks)	$0.07–0.55/DBU	Same per-cloud, but higher total	Running in multiple regions increases cost [PRICING-CHECK]
Data Egress	Minimal	$0.08–0.12/GB cross-cloud	The silent killer — plan data locality carefully [PRICING-CHECK]
Engineering Overhead	1x	1.2–1.4x	Abstraction layers, testing across clouds
Catalog (Unity Catalog)	Included with Databricks	Included with Databricks	Polaris is free (OSS) but requires self-hosting
Observability	Provider-native (lower cost)	Cross-cloud tooling (Datadog, Grafana)	Provider-native monitoring doesn't span clouds [PRICING-CHECK]

Rule of thumb: A well-designed cloud-agnostic lakehouse costs 20–35% more than an equivalent single-cloud deployment. The premium decreases as workload volume increases because the portability layer is mostly fixed cost. [PRICING-CHECK]

Where Harbinger Explorer Fits

If you're exploring data across multiple cloud environments during the design phase of a lakehouse architecture, Harbinger Explorer can help. Its browser-based DuckDB engine lets you query CSV exports, API responses, and uploaded datasets from any cloud — no infrastructure setup required. Useful for quick cross-cloud data profiling before committing to a table format or catalog strategy.

Key Takeaways

The cloud-agnostic data lakehouse is a real architecture pattern, not a vendor fantasy — but it demands disciplined engineering. Start with open table formats (Iceberg or Delta with UniForm) as the non-negotiable foundation. Use Terraform to encode infrastructure patterns that translate across clouds. Pick your catalog based on whether you're willing to depend on Databricks (Unity Catalog) or want full independence (Polaris). And be honest about whether your organization actually needs multi-cloud portability — for many teams, the 20–35% cost premium isn't justified.

The best time to design for portability is before you have 500 TB locked into a proprietary format. The second-best time is now.

Continue Reading

Markers

[VERIFY]:

Hudi MoR performance claim for high-frequency CDC vs Delta/Iceberg
Apache Polaris maturity status in 2026

[PRICING-CHECK]:

AWS S3 egress: $0.09/GB
Azure ADLS egress: $0.087/GB
GCP GCS egress: $0.12/GB
Databricks DBU pricing: $0.07–0.55/DBU
EMR pricing: $0.015–0.27/hr per instance
Dataproc pricing: $0.01–0.20/hr per vCPU
Databricks premium over native: 20–40%
Multi-cloud Terraform maintenance multiplier: 1.5–2x
Cross-cloud data egress: $0.08–0.12/GB
Cross-cloud observability tooling cost
Overall cloud-agnostic premium: 20–35%

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Cloud-Agnostic Data Lakehouse: Portable Architectures

TL;DR

The Real Cost of Cloud Lock-in

Open Table Formats: Delta Lake vs Iceberg vs Hudi

Feature Comparison

When to Choose What

Terraform for Multi-Cloud Databricks

What Terraform Manages

The Module Pattern

What Terraform Can't Solve

Storage Abstraction: S3 / ADLS / GCS

Storage Layer Comparison

The Abstraction Strategy

Compute Layer: Databricks vs EMR vs Dataproc

Compute Comparison

The Pragmatic Choice

Catalog Layer: Unity Catalog vs Glue vs Polaris

Catalog Comparison

Catalog Strategy for Multi-Cloud

Decision Matrix: Cloud-Agnostic vs All-In

The Honest Assessment

Architecture Diagram

Cost Considerations

Where Harbinger Explorer Fits

Key Takeaways

Continue Reading

Markers

Continue Reading

Data Governance Framework: A Practical Guide for Data Teams

Apache Spark Tutorial: From Zero to Your First Data Pipeline

What Is a Data Catalog? Tools, Trade-offs and When You Need One

Try Harbinger Explorer for free

TL;DR

The Real Cost of Cloud Lock-in

Open Table Formats: Delta Lake vs Iceberg vs Hudi

Feature Comparison

When to Choose What

Terraform for Multi-Cloud Databricks

What Terraform Manages

The Module Pattern

What Terraform Can't Solve

Storage Abstraction: S3 / ADLS / GCS

Storage Layer Comparison

The Abstraction Strategy

Compute Layer: Databricks vs EMR vs Dataproc

Compute Comparison

The Pragmatic Choice

Catalog Layer: Unity Catalog vs Glue vs Polaris

Catalog Comparison

Catalog Strategy for Multi-Cloud

Decision Matrix: Cloud-Agnostic vs All-In

The Honest Assessment

Architecture Diagram

Cost Considerations

Where Harbinger Explorer Fits

Key Takeaways

Continue Reading

Markers

Continue Reading

Data Governance Framework: A Practical Guide for Data Teams

Apache Spark Tutorial: From Zero to Your First Data Pipeline

What Is a Data Catalog? Tools, Trade-offs and When You Need One

Try Harbinger Explorer for free

Command Palette