Inhaltsverzeichnis26 Abschnitte

TL;DR
Die echten Kosten von Cloud-Lock-in
Open Table Formats: Delta Lake vs. Iceberg vs. Hudi
Feature-Vergleich
Wann was wählen
Terraform für Multi-Cloud-Databricks
Was Terraform managed
Das Modul-Pattern
Was Terraform nicht löst
Storage-Abstraction: S3 / ADLS / GCS
Storage-Layer-Vergleich
Abstraction-Strategie
Compute-Layer: Databricks vs. EMR vs. Dataproc
Compute-Vergleich
Die pragmatische Wahl
Catalog-Layer: Unity Catalog vs. Glue vs. Polaris
Catalog-Vergleich
Catalog-Strategie für Multi-Cloud
Decision-Matrix: Cloud-agnostisch vs. All-In
Ehrliche Einschätzung
Architektur-Diagramm
Cost-Considerations
Wo Harbinger Explorer passt
FAQ
Key-Takeaways
Weiterlesen

Cloud-agnostisches Data Lakehouse: Portable Architekturen mit Terraform, Delta und Iceberg

TL;DR

Cloud-Lock-in ist die stille Steuer auf deine Data-Platform. Dieser Artikel zeigt eine praktische Architektur für ein cloud-agnostisches Data Lakehouse mit Terraform, Delta Lake oder Apache Iceberg und abstrahierten Storage- und Compute-Schichten. Vergleichstabellen, Decision-Matrix und Architektur-Diagramm — alles, was du brauchst, um zu evaluieren, ob Multi-Cloud-Portabilität das Engineering-Invest für deine Org wert ist.

Die echten Kosten von Cloud-Lock-in

Jeder Cloud-Provider will dich all-in. AWS pusht Glue + Athena + Lake Formation. Azure drängt Richtung Synapse + Purview + ADLS. GCP pitched BigQuery + Dataplex + GCS. Jeder Stack funktioniert — bis deine Company eine Subsidiary übernimmt, die auf einer anderen Cloud läuft, dein Board Multi-Cloud-Strategie verlangt oder dein Primary-Provider eine 30%-Preiserhöhung mit 12 Monaten Notice ankündigt.

Cloud-Lock-in ist nicht nur Vendor-Abhängigkeit. Es manifestiert sich in drei konkreten Wegen:

Data Gravity — Petabytes Daten in proprietären Formaten, die ein Vermögen zu bewegen kosten
Skill-Lock-in — Teams exklusiv auf einer Provider-Tooling trainiert
Vertrags-Leverage — Null Negotiation-Power, wenn 100% deiner Workloads in einer Cloud sitzen

Ein cloud-agnostisches Data Lakehouse bedeutet nicht, alles überall simultan zu laufen. Es bedeutet, deine Architektur so zu designen, dass Workload-Migration zwischen Clouds ein Engineering-Projekt ist, kein Rewrite.

Open Table Formats: Delta Lake vs. Iceberg vs. Hudi

Das Table-Format ist das Fundament der Portabilität. Wenn deine Daten in proprietärem Format gespeichert sind, ist nichts anderes wichtig — du bist am Storage-Layer gelockt.

Feature-Vergleich

Dimension	Delta Lake	Apache Iceberg	Apache Hudi
Origin	Databricks (2019)	Netflix → Apache (2018)	Uber → Apache (2017)
Governance	Databricks-led OSS + proprietäre Extensions	Apache Foundation, vendor-neutral	Apache Foundation, vendor-neutral
Cloud-Portabilität	Hoch (seit UniForm)	Höchste — von Tag 1 cloud-agnostisch	Mittel — strongest auf AWS
Engine-Support	Spark, Trino, Flink, Presto, DuckDB, Polaris	Spark, Trino, Flink, Presto, Dremio, Snowflake, BigQuery	Spark, Trino, Flink, Presto
Catalog-Interop	Unity Catalog, HMS, Glue, Polaris (via UniForm)	HMS, Glue, Polaris, Nessie, Unity Catalog	HMS, Glue
Schema-Evolution	Add/Rename/Reorder Columns	Add/Rename/Reorder/Drop, Partition-Evolution	Add Columns, limitierte Evolution
Time-Travel	Transaction-Log-based	Snapshot-based, immutable	Timeline-based
Streaming	Structured Streaming native	Flink-Integration rapidly improving	Strongest Streaming (MoR-Tables)
Adoption-Trend (2026)	Dominant in Databricks-Shops	Schnellst-wachsend, multi-vendor Momentum	Nische, primär AWS/Uber
UniForm / Interop	Delta UniForm liest als Iceberg	Native	Limitierte Cross-Format-Reads

Wann was wählen

Delta Lake wenn Databricks deine Primary-Compute-Engine ist und du tightste Integration willst. UniForm bridged jetzt zu Iceberg-Readers — reasonable Portabilität ohne das Delta-Ökosystem zu verlassen.

Apache Iceberg wenn Multi-Engine-Access und Vendor-Neutralität non-negotiable sind. Icebergs Catalog-Level-Design und Partition-Evolution machen es zur strongest Wahl für Multi-Cloud/Multi-Engine.

Apache Hudi wenn dein Primary-Use Streaming-Ingestion mit Record-Level-Upserts auf AWS ist. Hudi-MoR-Tables outperformen Alternativen für High-Frequency-CDC-Pipelines — aber der Ecosystem-Gap weitet sich.

Empfehlung: Für neue cloud-agnostische Architekturen 2026 ist Iceberg das pragmatische Default. Delta Lake ist richtig, wenn du in Databricks investiert bist. Hudi ist increasingly hard zu justifizieren für Greenfield.

Terraform für Multi-Cloud-Databricks

Terraform ist die Lingua Franca cloud-agnostischer Infrastruktur. Für Multi-Cloud-Lakehouse handhabt es den härtesten Teil: drei fundamentally verschiedene Cloud-Environments ähnlich genug aussehen lassen, um dieselben Workloads zu laufen.

Was Terraform managed

Ein Multi-Cloud-Databricks-Deployment mit Terraform covert typischerweise:

Workspace-Provisioning — Databricks-Workspaces auf AWS, Azure, GCP mit konsistenten Naming, Tags und Network-Configs
Storage-Backends — S3-Buckets, ADLS-Container oder GCS-Buckets mit identischen IAM-Policies (pro Cloud übersetzt)
Unity Catalog — Metastore-Creation, External-Locations und Credential-Management across Clouds
Cluster-Policies — Standardisierte Compute-Configs enforced across all Environments
Networking — VPC/VNet-Peering, Private-Endpoints und Firewall-Rules pro Cloud

Das Modul-Pattern

Der Key-Architektur-Pattern: shared Module mit cloud-spezifischen Implementations:

terraform/
├── modules/
│   ├── lakehouse-core/        # Cloud-agnostic: catalog, schemas, permissions
│   ├── lakehouse-aws/         # AWS-specific: S3, IAM roles, VPC
│   ├── lakehouse-azure/       # Azure-specific: ADLS, service principals, VNet
│   └── lakehouse-gcp/         # GCP-specific: GCS, service accounts, VPC
├── environments/
│   ├── aws-prod/
│   ├── azure-prod/
│   └── gcp-staging/
└── variables/
    └── shared.tfvars          # Cross-cloud defaults

Das lakehouse-core-Modul definiert die logische Architektur. Cloud-spezifische Module übersetzen das in Provider-Native-Resources. Environment-Dirs komposen sie.

Was Terraform nicht löst

Terraform handhabt Infrastruktur, nicht App-Logic. Du brauchst noch:

Daten-Replikation-Strategie — Wie Daten zwischen Clouds bewegen (falls nötig)
Job-Orchestration — Airflow, Prefect oder Databricks-Workflows mit Cross-Cloud-DAGs
Secret-Management — Vault oder Provider-native Secret-Stores mit Unified-Interface
Monitoring — Datadog, Grafana oder provider-agnostischer Observability-Stack

Storage-Abstraction: S3 / ADLS / GCS

Auf Storage-Layer bieten alle drei Clouds Object-Storage, das funktional equivalent für Lakehouse-Workloads ist. Unterschiede: Naming, Auth, Performance.

Storage-Layer-Vergleich

Dimension	AWS S3	Azure ADLS Gen2	GCP GCS
Path-Format	`s3://bucket/path`	`abfss://container@account.dfs.core.windows.net/path`	`gs://bucket/path`
Auth-Model	IAM-Roles / Instance-Profiles	Service Principals / Managed Identity	Service Accounts / Workload Identity
Hierarchical Namespace	Flat (Prefix-based)	Native HNS	Flat (Prefix-based)
Konsistenz	Strong (seit 2020)	Strong	Strong
Typ. Egress-Kosten	0,09 USD/GB	0,087 USD/GB	0,12 USD/GB
Cross-Region-Replikation	S3-Replication	GRS / RA-GRS	Dual/Multi-Region-Buckets
Iceberg-Support	Native via Glue/S3-Tables	Native via Unity Catalog	Native via BigLake

Abstraction-Strategie

Bau keine Custom-Storage-Abstraction-Layer. Stattdessen:

Open Table Formats nutzen — Delta/Iceberg-Metadata ist path-based. Das Table-Format abstrahiert das Storage-Protokoll.
Storage im Catalog konfigurieren — Unity-Catalog-External-Locations oder Iceberg-Catalogs mappen logische Table-Names zu physical Cloud-Paths.
IAM-Patterns standardisieren — Jede Cloud-Auth ist anders, aber das Pattern (Service-Identity → Role → Storage-Access) ist gleich. Terraform-Module encoden das.

Das Ziel ist nicht, S3 und ADLS identisch im Code aussehen zu lassen. Es ist, dass Storage-Backend-Switch eine Terraform-Variable-Change und Daten-Migration ist — kein Rewrite jeder Pipeline.

Compute-Layer: Databricks vs. EMR vs. Dataproc

Compute ist, wo cloud-agnostisch teuer wird. Dieselbe Engine across all Clouds zu laufen ist der einfachste Path, aber mit Trade-Offs.

Compute-Vergleich

Dimension	Databricks (Multi-Cloud)	AWS EMR	GCP Dataproc
Verfügbar auf	AWS, Azure, GCP	Nur AWS	Nur GCP
Engine	Photon (optimierter Spark)	Spark, Hive, Presto, Flink	Spark, Hive, Presto, Flink
Managed Delta/Iceberg	Native Delta + Iceberg via UniForm	Iceberg native, Delta via OSS	Iceberg via BigLake, Delta via OSS
Unity Catalog	Ja (Cross-Cloud)	Nein	Nein
Auto-Scaling	Photon-optimized	YARN-based	YARN-based
Serverless-Option	Serverless SQL + Jobs	EMR Serverless	Dataproc Serverless
DBU/Compute-Kosten	0,07-0,55 USD/DBU	0,015-0,27 USD/h pro Instance	0,01-0,20 USD/h pro vCPU
Portabilität	Hoch (selbe API across Clouds)	Keine (nur AWS)	Keine (nur GCP)

Die pragmatische Wahl

Databricks als Cross-Cloud-Compute-Layer ist das häufigste Pattern für Orgs mit genuinem Multi-Cloud-Need. Dieselben Notebooks, Jobs und SQL-Warehouses arbeiten identisch auf AWS, Azure, GCP. Unity Catalog liefert eine Single-Governance-Plane.

Trade-off: Databricks-Pricing-Premiums über Provider-Native-Options. Für 100-Node-Spark-Workload zahlst du 20-40% mehr als EMR oder Dataproc direkt.

Wann Provider-Native-Compute Sinn macht: Wenn du 80%+ Workloads auf einer Cloud läufst mit occasional Burst zu anderer, nutz Native-Compute für deine Primary und akzeptier Migration-Cost für rare Cases.

Catalog-Layer: Unity Catalog vs. Glue vs. Polaris

Der Catalog ist die Control-Plane deines Lakehouse. Er bestimmt, wer was sieht, wo Daten leben und wie Engines sie discovern.

Catalog-Vergleich

Dimension	Unity Catalog	AWS Glue Data Catalog	Apache Polaris (Snowflake)
Multi-Cloud	Ja (AWS, Azure, GCP)	Nur AWS	Cloud-agnostisch (OSS)
Table-Formate	Delta, Iceberg (via UniForm)	Iceberg, Hudi, Delta (limited)	Nur Iceberg
Governance	RBAC + ABAC, Column-Masking, Row-Filters	IAM-based, Lake Formation	REST-Catalog-Spec, Engine-Level-Auth
Data-Lineage	Eingebaut	Keine (Third-Party)	Keine
Daten-Sharing	Delta Sharing (open Protokoll)	Lake Formation Cross-Account	Iceberg-REST-Catalog-Protokoll
Vendor-Lock-in-Risk	Mittel — Databricks-specific aber open Protocol	Hoch — AWS-only	Niedrig — OSS Apache
Maturity (2026)	Production-grade, widely adopted	Production-grade, AWS-dominant	Early Production, fast-growing

Catalog-Strategie für Multi-Cloud

Zwei viable Patterns:

Unity Catalog als Universal-Plane — Funktioniert, wenn Databricks deine Primary-Compute ist. UC spannt alle drei Clouds, liefert Lineage und supportet Delta Sharing für externe Consumer. Risk: Du wettest auf Databricks-fortgesetzte Multi-Cloud-Investment.
Polaris als open Alternative — Apache Polaris implementiert die Iceberg-REST-Catalog-Spec. Jede Iceberg-kompatible Engine kann Tables discovern und queryen. Keine Vendor-Dependency, aber du verlierst eingebaute Lineage und Governance. Mit OpenMetadata oder DataHub paaren.

Decision-Matrix: Cloud-agnostisch vs. All-In

Nicht jede Org braucht Multi-Cloud-Lakehouse. Strukturiertes Framework für die Decision.

Faktor	Favorisiert cloud-agnostisch	Favorisiert all-in Single-Cloud
Regulatorische Requirements	Multi-Region/Multi-Jurisdiction-Mandate	Single-Country Daten-Residency
M&A-Aktivität	Häufige Acquisitions mit Mixed-Cloud	Stabile Org-Struktur
Vendor-Negotiation	Brauchst Pricing-Leverage	Strong existierender Enterprise-Agreement
Team-Größe	10+ Data-Engineers für Komplexität	Kleines Team, braucht Simplicity
Datenvolumen	< 50 TB (Migration feasible)	> 500 TB (Data-Gravity zu stark)
Workload-Diversity	Multiple Engines nötig (Spark, Trino, Flink)	Single-Engine reicht
Time-to-Market	Kann 3-6 Monate in Platform investieren	Production in Wochen
Jährlicher Cloud-Spend	> 500k USD (Lock-in-Risk material)	< 100k USD (Portabilitäts-Kosten > Lock-in)

Ehrliche Einschätzung

Cloud-agnostische Architektur addet 15-30% Engineering-Overhead in Design- und Build-Phase. Lohnt sich wenn:

Du heute already across mehreren Clouds operierst (Post-M&A ist #1-Driver)
Annual Cloud-Spend > 500k USD und du Negotiation-Leverage brauchst
Regulatorische Requirements geografische oder Provider-Diversifikation verlangen

Lohnt sich oft nicht wenn:

Single-Cloud-Shop mit < 200k USD/Jahr
Kleines Team, muss schnell shippen
Data-Gravity (500+ TB in einer Cloud) macht Migration impraktikabel

Architektur-Diagramm

graph TB
    subgraph "Infrastructure Layer"
        TF[Terraform Modules]:::blue
    end

    subgraph "Cloud Providers"
        AWS[AWS<br/>S3 + EMR]:::amber
        AZ[Azure<br/>ADLS + Databricks]:::blue
        GCP[GCP<br/>GCS + Dataproc]:::green
    end

    subgraph "Open Table Format"
        ICE[Apache Iceberg /<br/>Delta UniForm]:::purple
    end

    subgraph "Catalog Layer"
        UC[Unity Catalog /<br/>Apache Polaris]:::rose
    end

    subgraph "Compute Layer"
        DBX[Databricks<br/>Multi-Cloud]:::amber
        OSS[Trino / Flink /<br/>Spark OSS]:::green
    end

    subgraph "Consumers"
        BI[BI Tools]:::default
        DS[Data Science]:::default
        APP[Applications]:::default
    end

    TF -->|provisions| AWS
    TF -->|provisions| AZ
    TF -->|provisions| GCP

    AWS --> ICE
    AZ --> ICE
    GCP --> ICE

    ICE --> UC

    UC --> DBX
    UC --> OSS

    DBX --> BI
    DBX --> DS
    OSS --> APP

    classDef blue fill:#F0F4FF,stroke:#1e3a8a,stroke-width:2px,color:#1e1e2e,font-weight:bold
    classDef green fill:#F0FFF4,stroke:#14532d,stroke-width:2px,color:#1e1e2e,font-weight:bold
    classDef amber fill:#FFFBEB,stroke:#92400e,stroke-width:2px,color:#1e1e2e,font-weight:bold
    classDef rose fill:#FFF0F0,stroke:#7f1d1d,stroke-width:2px,color:#1e1e2e,font-weight:bold
    classDef purple fill:#FAF0FF,stroke:#4c1d95,stroke-width:2px,color:#1e1e2e,font-weight:bold
    classDef default fill:#FAFAF8,stroke:#1e1e2e,stroke-width:2px,color:#1e1e2e

Wie es fließt:

Terraform provisioned identische Infra-Patterns across AWS, Azure, GCP — Storage-Buckets, IAM-Roles, Networking, Compute-Cluster.
Cloud-Storage (S3, ADLS, GCS) hält Daten in Open Table Formats (Iceberg oder Delta mit UniForm), sodass jede Engine die Daten lesen kann.
Catalog-Layer (Unity Catalog oder Polaris) liefert Single-Logical-Namespace across Clouds — Discovery, Governance, Access-Control.
Compute-Engines — Databricks für Managed-Workloads, OSS-Engines (Trino, Flink, Spark) für specialized oder cost-sensitive — queryen Daten durch den Catalog.
Consumer accessen Daten durch Compute-Layer, unaware wo die Daten physisch leben.

Cost-Considerations

Cloud-agnostisch bauen ist nicht free. Budget für diese Cost-Kategorien:

Cost-Kategorie	Single-Cloud-Baseline	Multi-Cloud-Premium	Notes
Infrastruktur (Terraform)	1x	1,5-2x	Module für 3 Provider pflegen
Compute (Databricks)	0,07-0,55 USD/DBU	Selbe pro Cloud, höher total	Multiple Regions erhöhen Cost
Daten-Egress	Minimal	0,08-0,12 USD/GB cross-cloud	Der stille Killer — Locality planen
Engineering-Overhead	1x	1,2-1,4x	Abstraction-Layers, Cross-Cloud-Testing
Catalog (Unity Catalog)	Inkl. mit Databricks	Inkl. mit Databricks	Polaris ist free (OSS), aber Self-Hosting
Observability	Provider-native (cheaper)	Cross-Cloud-Tooling (Datadog, Grafana)	Provider-native spannt nicht Clouds

Faustregel: Ein gut-designtes cloud-agnostisches Lakehouse kostet 20-35% mehr als equivalent Single-Cloud. Premium sinkt mit Workload-Volume, weil Portabilitäts-Layer mostly Fixed-Cost ist.

Wo Harbinger Explorer passt

Wenn du Daten across multiple Cloud-Environments während Design-Phase explorierst, kann Harbinger Explorer helfen. Browser-basierte DuckDB-Engine lässt dich CSV-Exports, API-Responses und Datasets aus jeder Cloud queryen — kein Infra-Setup. Nützlich für schnelles Cross-Cloud-Profiling vor Commitment zu Table-Format oder Catalog-Strategie.

FAQ

Wann macht Multi-Cloud-Lakehouse echt Sinn für DACH-Companies? Wenn DSGVO-Anforderungen Multi-Region (z.B. eu-central-1 + eu-west-1) oder Provider-Diversifikation für Sovereignty verlangen. Sonst: Single-Cloud mit eu-central-1 reicht für die meisten DACH-Use-Cases.

Iceberg oder Delta für ein neues Lakehouse 2026? Iceberg ist das pragmatische Default für Vendor-Neutralität. Delta wenn Databricks deine Primary-Engine ist und du UniForm für Iceberg-Compat nutzt.

Wie viel kostet Multi-Cloud-Egress in EUR? Cross-Cloud-Egress liegt bei ~0,08-0,12 USD/GB (~0,07-0,11 EUR/GB). Bei 10 TB Cross-Cloud pro Monat: ~1.000 EUR allein für Egress.

Lohnt Apache Polaris für Solo-Founder oder kleine Teams? Nein, Self-Hosting-Overhead ist zu hoch. Bleib bei Unity Catalog (wenn Databricks) oder Glue (wenn AWS-only). Polaris wird interessant bei 10+ Data-Engineers und genuinem Multi-Cloud-Need.

Key-Takeaways

Das cloud-agnostische Data Lakehouse ist ein reales Pattern, keine Vendor-Fantasie — verlangt aber disziplinierten Engineering. Start mit Open Table Formats (Iceberg oder Delta mit UniForm) als non-negotiable Foundation. Terraform für Infra-Patterns. Catalog basierend auf Databricks-Dependency (Unity Catalog) oder Full-Independence (Polaris). Und ehrlich sein, ob deine Org Multi-Cloud-Portabilität tatsächlich braucht — für viele Teams ist die 20-35%-Cost-Premium nicht justified.

Die beste Zeit, für Portabilität zu designen, ist bevor du 500 TB in proprietary Format gelockt hast. Die zweitbeste ist jetzt.

Weiterlesen

Stand: 14. Mai 2026. Cloud-Provider passen Preise an — kritische Annahmen direkt bei AWS, Azure und GCP verifizieren.

Geschrieben von

Harbinger Team

Cloud-, Data- und AI-Engineer in DACH. Schreibt seit 2018 über infrastrukturkritische Tech-Entscheidungen — keine Marketing- Folien, sondern echte Trade-offs aus Production-Workloads.

Mehr über Marc hello@harbingerexplorer.com

Hat dir das geholfen?

Jede Woche ein neuer Artikel über DACH-Cloud, Data und AI — direkt in dein Postfach. Kein Spam, kein Marketing-Sprech.

Kein Spam. 1-Klick-Abmeldung. Datenschutz bei Loops.so.

Cloud-agnostisches Data Lakehouse: Portable Architekturen mit Terraform, Delta und Iceberg

Cloud-agnostisches Data Lakehouse: Portable Architekturen mit Terraform, Delta und Iceberg

TL;DR

Die echten Kosten von Cloud-Lock-in

Open Table Formats: Delta Lake vs. Iceberg vs. Hudi

Feature-Vergleich

Wann was wählen

Terraform für Multi-Cloud-Databricks

Was Terraform managed

Das Modul-Pattern

Was Terraform nicht löst

Storage-Abstraction: S3 / ADLS / GCS

Storage-Layer-Vergleich

Abstraction-Strategie

Compute-Layer: Databricks vs. EMR vs. Dataproc

Compute-Vergleich

Die pragmatische Wahl

Catalog-Layer: Unity Catalog vs. Glue vs. Polaris

Catalog-Vergleich

Catalog-Strategie für Multi-Cloud

Decision-Matrix: Cloud-agnostisch vs. All-In

Ehrliche Einschätzung

Architektur-Diagramm

Cost-Considerations

Wo Harbinger Explorer passt

FAQ

Key-Takeaways

Weiterlesen

Weitere Artikel aus Cloud allgemein

Streaming vs Batch Processing: Wann was nutzen (2026)

Surrogate vs Natural Keys: Wann was nutzen (2026)

Event-Driven Data Architecture mit Kafka und CQRS