Cloud-Agnostic Data Lakehouse: Portable Architectures
TL;DR
Cloud lock-in is the silent tax on your data platform. This article lays out a practical architecture for building a cloud-agnostic data lakehouse using Terraform for infrastructure, Delta Lake or Apache Iceberg for the open table format, and abstracted storage and compute layers. You'll get comparison tables, a decision matrix, and an architecture diagram — everything you need to evaluate whether multi-cloud portability is worth the engineering investment for your organization.
The Real Cost of Cloud Lock-in
Every cloud provider wants you all-in. AWS pushes Glue + Athena + Lake Formation. Azure nudges you toward Synapse + Purview + ADLS. GCP pitches BigQuery + Dataplex + GCS. Each stack works well — until your company acquires a subsidiary running on a different cloud, your board mandates a multi-cloud strategy, or your primary provider announces a 30% price increase with 12 months' notice.
Cloud lock-in isn't just about vendor dependency. It manifests in three concrete ways:
- Data gravity — Petabytes of data in proprietary formats that cost a fortune to move
- Skill lock-in — Teams trained exclusively on one provider's tooling and mental models
- Contract leverage — Zero negotiation power when 100% of your workloads sit in one cloud
A cloud-agnostic data lakehouse doesn't mean running everything everywhere simultaneously. It means designing your architecture so that migrating workloads between clouds is an engineering project, not a rewrite.
Open Table Formats: Delta Lake vs Iceberg vs Hudi
The table format is the foundation of portability. If your data is stored in a proprietary format, nothing else matters — you're locked in at the storage layer.
Feature Comparison
| Dimension | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Origin | Databricks (2019) | Netflix → Apache (2018) | Uber → Apache (2017) |
| Governance | Databricks-led OSS + proprietary extensions | Apache Foundation, vendor-neutral | Apache Foundation, vendor-neutral |
| Cloud Portability | High (since UniForm) | Highest — designed cloud-agnostic from day one | Medium — strongest on AWS |
| Engine Support | Spark, Trino, Flink, Presto, DuckDB, Polaris | Spark, Trino, Flink, Presto, Dremio, Snowflake, BigQuery | Spark, Trino, Flink, Presto |
| Catalog Interop | Unity Catalog, HMS, Glue, Polaris (via UniForm) | HMS, Glue, Polaris, Nessie, Unity Catalog | HMS, Glue |
| Schema Evolution | Add/rename/reorder columns | Add/rename/reorder/drop columns, partition evolution | Add columns, limited evolution |
| Time Travel | Transaction log-based | Snapshot-based, immutable | Timeline-based |
| Streaming | Structured Streaming native | Flink integration improving rapidly | Strongest streaming story (MoR tables) |
| Adoption Trend (2026) | Dominant in Databricks shops | Fastest-growing, multi-vendor momentum | Niche, primarily AWS/Uber ecosystem |
| UniForm / Interop | Delta UniForm reads as Iceberg | Native | Limited cross-format reads |
When to Choose What
Choose Delta Lake when Databricks is your primary compute engine and you want the tightest integration. UniForm now bridges the gap to Iceberg-compatible readers, giving you a reasonable portability story without leaving the Delta ecosystem.
Choose Apache Iceberg when multi-engine access and vendor neutrality are non-negotiable requirements. Iceberg's catalog-level design and partition evolution make it the strongest choice for organizations running workloads across multiple clouds or engines.
Choose Apache Hudi when your primary use case is streaming ingestion with record-level upserts on AWS. Hudi's Merge-on-Read tables still outperform alternatives for high-frequency CDC pipelines — but the ecosystem support gap is widening. [VERIFY]
My take: For new cloud-agnostic architectures in 2026, Iceberg is the pragmatic default. Delta Lake is the right choice if you're already invested in Databricks. Hudi is increasingly hard to justify for greenfield projects.
Terraform for Multi-Cloud Databricks
Terraform is the lingua franca of cloud-agnostic infrastructure. For a multi-cloud lakehouse, it handles the hardest part: making three fundamentally different cloud environments look similar enough to run the same workloads.
What Terraform Manages
A multi-cloud Databricks deployment with Terraform typically covers:
- Workspace provisioning — Databricks workspaces on AWS, Azure, and GCP with consistent naming, tags, and network configurations
- Storage backends — S3 buckets, ADLS containers, or GCS buckets with identical IAM policies (translated per cloud)
- Unity Catalog — Metastore creation, external locations, and credential management across clouds
- Cluster policies — Standardized compute configurations enforced across all environments
- Networking — VPC/VNet peering, private endpoints, and firewall rules per cloud
The Module Pattern
The key architectural pattern is a shared module with cloud-specific implementations:
terraform/
├── modules/
│ ├── lakehouse-core/ # Cloud-agnostic: catalog, schemas, permissions
│ ├── lakehouse-aws/ # AWS-specific: S3, IAM roles, VPC
│ ├── lakehouse-azure/ # Azure-specific: ADLS, service principals, VNet
│ └── lakehouse-gcp/ # GCP-specific: GCS, service accounts, VPC
├── environments/
│ ├── aws-prod/
│ ├── azure-prod/
│ └── gcp-staging/
└── variables/
└── shared.tfvars # Cross-cloud defaults
The lakehouse-core module defines the logical architecture. The cloud-specific modules translate that into provider-native resources. Environment directories compose them together.
What Terraform Can't Solve
Terraform handles infrastructure, not application logic. You still need:
- Data replication strategy — How data moves between clouds (if it needs to)
- Job orchestration — Airflow, Prefect, or Databricks Workflows running cross-cloud DAGs
- Secret management — Vault or provider-native secret stores with a unified interface
- Monitoring — Datadog, Grafana, or a provider-agnostic observability stack
Storage Abstraction: S3 / ADLS / GCS
At the storage layer, all three clouds offer object storage that's functionally equivalent for lakehouse workloads. The differences are in naming, authentication, and performance characteristics.
Storage Layer Comparison
| Dimension | AWS S3 | Azure ADLS Gen2 | GCP GCS |
|---|---|---|---|
| Path Format | s3://bucket/path | abfss://container@account.dfs.core.windows.net/path | gs://bucket/path |
| Auth Model | IAM Roles / Instance Profiles | Service Principals / Managed Identity | Service Accounts / Workload Identity |
| Hierarchical Namespace | Flat (prefix-based) | Native HNS | Flat (prefix-based) |
| Consistency | Strong (since 2020) | Strong | Strong |
| Typical Egress Cost | $0.09/GB [PRICING-CHECK] | $0.087/GB [PRICING-CHECK] | $0.12/GB [PRICING-CHECK] |
| Cross-Region Replication | S3 Replication | GRS / RA-GRS | Dual/Multi-region buckets |
| Iceberg Support | Native via Glue/S3 Tables | Native via Unity Catalog | Native via BigLake |
The Abstraction Strategy
Don't build a custom storage abstraction layer. Instead:
- Use open table formats — Delta/Iceberg metadata is path-based. The table format abstracts the storage protocol.
- Configure storage in the catalog — Unity Catalog external locations or Iceberg catalogs map logical table names to physical cloud paths.
- Standardize IAM patterns — Each cloud's authentication is different, but the pattern (service identity → role → storage access) is the same. Terraform modules encode this pattern.
The goal isn't to make S3 and ADLS look identical in code. It's to ensure that switching the storage backend is a Terraform variable change and a data migration, not a rewrite of every pipeline.
Compute Layer: Databricks vs EMR vs Dataproc
Compute is where cloud-agnostic gets expensive. Running the same engine across all three clouds is the simplest path, but it comes with trade-offs.
Compute Comparison
| Dimension | Databricks (Multi-Cloud) | AWS EMR | GCP Dataproc |
|---|---|---|---|
| Available On | AWS, Azure, GCP | AWS only | GCP only |
| Engine | Photon (optimized Spark) | Spark, Hive, Presto, Flink | Spark, Hive, Presto, Flink |
| Managed Delta/Iceberg | Native Delta + Iceberg via UniForm | Iceberg native, Delta via OSS | Iceberg via BigLake, Delta via OSS |
| Unity Catalog | Yes (cross-cloud) | No | No |
| Auto-Scaling | Photon-optimized | YARN-based | YARN-based |
| Serverless Option | Serverless SQL + Jobs | EMR Serverless | Dataproc Serverless |
| DBU/Compute Cost | $0.07–0.55/DBU [PRICING-CHECK] | $0.015–0.27/hr per instance [PRICING-CHECK] | $0.01–0.20/hr per vCPU [PRICING-CHECK] |
| Portability | High (same API across clouds) | None (AWS-only) | None (GCP-only) |
The Pragmatic Choice
Databricks as the cross-cloud compute layer is the most common pattern for organizations that need genuine multi-cloud. The same notebooks, jobs, and SQL warehouses work identically on AWS, Azure, and GCP. Unity Catalog provides a single governance plane across all three.
The trade-off: Databricks pricing premiums over provider-native options. For a 100-node Spark workload, you might pay 20–40% more than running EMR or Dataproc directly. [PRICING-CHECK]
When provider-native compute makes sense: If you're running 80%+ of workloads on one cloud with occasional burst to another, use the native compute service for your primary cloud and accept the migration cost for the rare case.
Catalog Layer: Unity Catalog vs Glue vs Polaris
The catalog is the control plane of your lakehouse. It determines who can see what data, where it lives, and how engines discover it.
Catalog Comparison
| Dimension | Unity Catalog | AWS Glue Data Catalog | Apache Polaris (Snowflake) |
|---|---|---|---|
| Multi-Cloud | Yes (AWS, Azure, GCP) | AWS only | Cloud-agnostic (open source) |
| Table Formats | Delta, Iceberg (via UniForm) | Iceberg, Hudi, Delta (limited) | Iceberg only |
| Governance | RBAC + ABAC, column masking, row filters | IAM-based, Lake Formation | REST catalog spec, engine-level auth |
| Data Lineage | Built-in | None (use third-party) | None |
| Data Sharing | Delta Sharing (open protocol) | Lake Formation cross-account | Iceberg REST catalog protocol |
| Vendor Lock-in Risk | Medium — Databricks-specific but open protocol | High — AWS-only | Low — open source Apache project |
| Maturity (2026) | Production-grade, widely adopted | Production-grade, AWS-dominant | Early production, growing fast [VERIFY] |
Catalog Strategy for Multi-Cloud
For genuine cloud-agnostic architectures, two viable patterns exist:
-
Unity Catalog as the universal plane — Works if Databricks is your primary compute. UC spans all three clouds, provides lineage, and supports Delta Sharing for external consumers. The risk: you're betting on Databricks' continued multi-cloud investment.
-
Polaris as the open alternative — Apache Polaris implements the Iceberg REST Catalog specification. Any Iceberg-compatible engine can discover and query tables. No vendor dependency, but you lose built-in lineage and governance features. Pair it with OpenMetadata or DataHub for the governance layer.
Decision Matrix: Cloud-Agnostic vs All-In
Not every organization needs a multi-cloud lakehouse. Here's a structured framework for the decision.
| Factor | Favors Cloud-Agnostic | Favors All-In Single Cloud |
|---|---|---|
| Regulatory requirements | Multi-region/multi-jurisdiction mandates | Single-country data residency |
| M&A activity | Frequent acquisitions with mixed cloud estates | Stable organizational structure |
| Vendor negotiation | Need pricing leverage against providers | Strong existing enterprise agreement |
| Team size | 10+ data engineers who can handle complexity | Small team, need simplicity |
| Data volume | < 50 TB (migration feasible) | > 500 TB (data gravity too strong) |
| Workload diversity | Multiple engines needed (Spark, Trino, Flink) | Single engine sufficient |
| Time to market | Can invest 3–6 months in platform | Need production in weeks |
| Annual cloud spend | > $500K (lock-in risk is material) | < $100K (portability cost exceeds lock-in cost) |
The Honest Assessment
Cloud-agnostic architecture adds 15–30% engineering overhead in the design and build phase. It pays off when:
- You operate across multiple clouds today (post-M&A is the #1 driver)
- Your annual cloud spend exceeds $500K and you need negotiation leverage
- Regulatory requirements mandate geographic or provider diversification
It's often not worth it when:
- You're a single-cloud shop with < $200K annual spend
- Your team is small and needs to ship fast
- Your data gravity (500+ TB in one cloud) makes migration impractical regardless
Architecture Diagram
Loading diagram...
How it flows:
① Terraform provisions identical infrastructure patterns across AWS, Azure, and GCP — storage buckets, IAM roles, networking, and compute clusters.
② Cloud storage (S3, ADLS, GCS) holds data in open table formats (Iceberg or Delta with UniForm), ensuring any engine can read the data regardless of where it's stored.
③ The catalog layer (Unity Catalog or Polaris) provides a single logical namespace across all clouds, handling discovery, governance, and access control.
④ Compute engines — Databricks for managed workloads, open-source engines (Trino, Flink, Spark) for specialized or cost-sensitive workloads — query data through the catalog.
⑤ Consumers access data through the compute layer, unaware of which cloud the data physically resides on.
🔵 Infrastructure 🟡 Cloud Providers 🟢 Open Source Engines 🔴 Governance 🟣 Table Format
Cost Considerations
Building cloud-agnostic isn't free. Budget for these cost categories:
| Cost Category | Single-Cloud Baseline | Multi-Cloud Premium | Notes |
|---|---|---|---|
| Infrastructure (Terraform) | 1x | 1.5–2x | Maintaining modules for 3 providers [PRICING-CHECK] |
| Compute (Databricks) | $0.07–0.55/DBU | Same per-cloud, but higher total | Running in multiple regions increases cost [PRICING-CHECK] |
| Data Egress | Minimal | $0.08–0.12/GB cross-cloud | The silent killer — plan data locality carefully [PRICING-CHECK] |
| Engineering Overhead | 1x | 1.2–1.4x | Abstraction layers, testing across clouds |
| Catalog (Unity Catalog) | Included with Databricks | Included with Databricks | Polaris is free (OSS) but requires self-hosting |
| Observability | Provider-native (lower cost) | Cross-cloud tooling (Datadog, Grafana) | Provider-native monitoring doesn't span clouds [PRICING-CHECK] |
Rule of thumb: A well-designed cloud-agnostic lakehouse costs 20–35% more than an equivalent single-cloud deployment. The premium decreases as workload volume increases because the portability layer is mostly fixed cost. [PRICING-CHECK]
Where Harbinger Explorer Fits
If you're exploring data across multiple cloud environments during the design phase of a lakehouse architecture, Harbinger Explorer can help. Its browser-based DuckDB engine lets you query CSV exports, API responses, and uploaded datasets from any cloud — no infrastructure setup required. Useful for quick cross-cloud data profiling before committing to a table format or catalog strategy.
Key Takeaways
The cloud-agnostic data lakehouse is a real architecture pattern, not a vendor fantasy — but it demands disciplined engineering. Start with open table formats (Iceberg or Delta with UniForm) as the non-negotiable foundation. Use Terraform to encode infrastructure patterns that translate across clouds. Pick your catalog based on whether you're willing to depend on Databricks (Unity Catalog) or want full independence (Polaris). And be honest about whether your organization actually needs multi-cloud portability — for many teams, the 20–35% cost premium isn't justified.
The best time to design for portability is before you have 500 TB locked into a proprietary format. The second-best time is now.
Continue Reading
- Data Lakehouse Architecture: Delta Lake vs Apache Iceberg in 2026
- Terraform for Data Engineers: Infrastructure as Code Patterns
- Unity Catalog Deep Dive: Governance Across Clouds
Markers
[VERIFY]:
- Hudi MoR performance claim for high-frequency CDC vs Delta/Iceberg
- Apache Polaris maturity status in 2026
[PRICING-CHECK]:
- AWS S3 egress: $0.09/GB
- Azure ADLS egress: $0.087/GB
- GCP GCS egress: $0.12/GB
- Databricks DBU pricing: $0.07–0.55/DBU
- EMR pricing: $0.015–0.27/hr per instance
- Dataproc pricing: $0.01–0.20/hr per vCPU
- Databricks premium over native: 20–40%
- Multi-cloud Terraform maintenance multiplier: 1.5–2x
- Cross-cloud data egress: $0.08–0.12/GB
- Cross-cloud observability tooling cost
- Overall cloud-agnostic premium: 20–35%
Continue Reading
Data Governance Framework: A Practical Guide for Data Teams
A hands-on guide to building a data governance framework that works in practice — covering ownership, policies, data quality, and tooling without the corporate fluff.
Apache Spark Tutorial: From Zero to Your First Data Pipeline
A hands-on Apache Spark tutorial covering core concepts, PySpark DataFrames, transformations, and real-world pipeline patterns for data engineers.
What Is a Data Catalog? Tools, Trade-offs and When You Need One
A clear definition of data catalogs, an honest comparison of DataHub, Atlan, Alation, and OpenMetadata, and a build-vs-buy framework for data teams.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial