Engineering

Published: Mar 24, 2026

Data Lakehouse Architecture Explained

6 min read·Tags: data lakehouse, delta lake, open table format, data architecture, lakehouse vs data warehouse, databricks, apache iceberg

Data lakehouse architecture promises the best of both worlds: the cheap, scalable storage of a data lake with the reliability and performance of a data warehouse. But does it actually deliver — and when should you think twice before adopting one?

This article breaks down how data lakehouse architecture works, compares it to traditional approaches, and walks through the practical trade-offs data engineers face when building one.

What Is a Data Lakehouse?

A data lakehouse is a data management architecture that combines the low-cost object storage of data lakes with the transactional guarantees and schema enforcement of data warehouses. The key enabler: open table formats like Delta Lake, Apache Iceberg, and Apache Hudi that add ACID transactions, time travel, and schema evolution on top of Parquet files stored in cloud object storage.

Instead of maintaining separate systems for raw data (lake) and curated data (warehouse), a lakehouse collapses them into a single architecture with distinct layers.

Lakehouse Architecture Layers

The typical data lakehouse architecture follows a medallion pattern — bronze, silver, and gold layers — built on top of cloud object storage.

Loading diagram...

① Cloud Object Storage — All data lives on cheap, scalable blob storage (S3, ADLS, GCS). No proprietary formats.

② Bronze Layer — Raw data lands here unchanged. Append-only ingestion from APIs, CDC streams, or batch loads. Schema-on-read.

③ Silver Layer — Data is cleaned, deduplicated, and conformed to standard schemas. This is where data quality enforcement happens.

④ Gold Layer — Business-ready aggregates, dimensions, and fact tables optimized for query performance.

⑤ Consumers — BI dashboards, ML training pipelines, and ad-hoc analytics all query the same gold-layer tables.

🟦 Storage 🟨 Raw Data 🟩 Cleaned Data 🟪 Curated Data 🟥 Consumers ⬜ Infrastructure

Delta Lake Architecture: The Most Common Implementation

When people say "data lakehouse architecture," they often mean Delta Lake — the open table format originally built by Databricks. The Databricks lakehouse platform has made Delta Lake the de facto standard for lakehouse implementations, though Apache Iceberg is gaining ground fast.

Delta Lake adds a transaction log (_delta_log/) on top of Parquet files. Every write — insert, update, delete, or schema change — creates a new JSON entry in this log, enabling:

ACID transactions across concurrent reads and writes
Time travel via versioned snapshots
Schema enforcement and evolution without rewriting data
Z-ordering and data skipping for query performance

Here's what a typical Delta Lake write looks like in PySpark:

# PySpark — Writing to a Delta Lake silver layer table
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read bronze-layer raw events
bronze_df = spark.read.format("delta").load("s3://my-lake/bronze/events/")

# Clean and deduplicate for silver layer
silver_df = (
    bronze_df
    .dropDuplicates(["event_id"])
    .filter(col("event_timestamp").isNotNull())
    .withColumn("processed_at", current_timestamp())
    .withColumn("row_hash", sha2(concat_ws("||", *bronze_df.columns), 256))
)

# Write with merge schema to handle evolution
silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .save("s3://my-lake/silver/events/")

Lakehouse vs Data Warehouse vs Data Lake

The lakehouse vs data warehouse debate isn't about one being universally better. Each architecture makes different trade-offs:

Dimension	Data Lake	Data Warehouse	Data Lakehouse
Storage format	Open (Parquet, ORC, JSON)	Proprietary	Open (Delta, Iceberg, Hudi)
ACID transactions	❌ No	✅ Yes	✅ Yes
Schema enforcement	Schema-on-read	Schema-on-write	Both supported
Query performance	Slow without optimization	Fast (optimized engine)	Fast with tuning (Z-order, compaction)
Cost at scale	Low (object storage)	High (compute-storage coupled)	Low-medium (object storage + compute)
ML/Data Science support	✅ Direct file access	❌ Export required	✅ Direct file access
Real-time ingestion	✅ Streaming native	⚠️ Micro-batch typical	✅ Streaming native
Governance & lineage	❌ Manual	✅ Built-in	✅ Via Unity Catalog / Polaris
Vendor lock-in	Low	High	Low-medium (format-dependent)
Maturity	10+ years	30+ years	3-5 years

When to choose a data warehouse: Your workload is primarily SQL analytics, your team is SQL-first, data volumes are moderate (< 10 TB), and you value simplicity over flexibility.

When to choose a data lake: You need cheap storage for massive volumes of unstructured data (images, logs, video) and your consumers are data scientists working in Python/Spark.

When to choose a data lakehouse: You want a single architecture serving both BI and ML workloads, you're dealing with large-scale structured and semi-structured data, and your team can handle the added operational complexity.

Querying Gold-Layer Lakehouse Tables

Once your lakehouse pipeline produces gold-layer tables, querying them with Spark SQL is straightforward:

-- Spark SQL — Query gold-layer aggregates with time travel
SELECT
    region,
    product_category,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    AVG(order_value) AS avg_order_value
FROM delta.`s3://my-lake/gold/order_summary/`
VERSION AS OF 42  -- time travel to specific version
WHERE order_date >= '2026-01-01'
GROUP BY region, product_category
ORDER BY total_revenue DESC;

This is also where the lakehouse shines compared to a traditional data lake: you get warehouse-style SQL semantics with data lake economics.

If you're working with gold-layer tables exported as CSV or exposed through APIs, tools like Harbinger Explorer let you query them directly in the browser using DuckDB WASM — useful for quick ad-hoc exploration without spinning up a Spark cluster.

Common Pitfalls

Building a data lakehouse architecture isn't just "put Delta Lake on S3 and call it done." These are the mistakes that burn teams:

1. Skipping compaction

Small files kill query performance. If your ingestion produces thousands of tiny Parquet files (common with streaming), you need regular OPTIMIZE runs. This isn't optional — it's operational hygiene.

2. Ignoring partition strategy

Over-partitioning on high-cardinality columns (like user_id) creates the small-file problem in disguise. Under-partitioning means full table scans. Spend time modeling your access patterns before choosing partition columns.

3. No data quality enforcement

A lakehouse without data quality checks is just a data lake with extra steps. Use expectations (Great Expectations, Soda, or Delta Live Tables expectations) at the bronze-to-silver boundary.

4. Treating the gold layer as a dump

The gold layer should be a curated, well-documented set of business-ready tables — not a copy of silver with some aggregations. Define clear ownership, SLAs, and naming conventions.

5. Underestimating governance

Open table formats give you the building blocks, but you still need a catalog (Unity Catalog, AWS Glue, Project Nessie) to manage access control, lineage, and discoverability. This is especially true with the Databricks lakehouse stack, where Unity Catalog is increasingly the expected governance layer.

6. Premature optimization

Z-ordering, liquid clustering, bloom filters — these are powerful, but applying them before understanding your query patterns is wasted effort. Profile your workloads first, then tune.

When NOT to Use a Lakehouse

A data lakehouse architecture is not always the right answer. Skip it when:

Your data fits in PostgreSQL. If you're dealing with < 100 GB of structured data and your queries are simple, a relational database with dbt is simpler, cheaper, and easier to operate. Don't over-architect.
Your team is pure SQL, zero Spark. Lakehouses assume some comfort with Spark or similar distributed compute frameworks. If your whole team is SQL-only, a managed warehouse (Snowflake, BigQuery) will be more productive.
You have no ML or data science workload. The main advantage of a lakehouse over a warehouse is unified access for both BI and ML. If you're only doing dashboards and reports, a warehouse is simpler.
You can't invest in operations. Lakehouses require ongoing maintenance: compaction, vacuum, monitoring partition sizes, managing catalog metadata. If you don't have the engineering bandwidth, the operational overhead will eat your productivity gains.
Vendor lock-in is a non-concern. If you've already committed to a cloud warehouse and it's working, migrating to a lakehouse for theoretical benefits is rarely worth the disruption.

The Open Table Format Landscape

The choice of open table format shapes your lakehouse implementation. As of early 2026:

Delta Lake — Strongest ecosystem (Databricks), best Spark integration, most mature. Default choice for Databricks lakehouse deployments.
Apache Iceberg — Best multi-engine support (Spark, Trino, Flink, Snowflake), strongest partition evolution. Growing fast.
Apache Hudi — Best for incremental/upsert-heavy workloads, strong CDC support. Smaller community.

My take: if you're on Databricks, Delta Lake is the pragmatic choice. If you want maximum engine flexibility, Iceberg is where the industry is converging. Hudi is solid but increasingly niche.

Key Takeaways

Data lakehouse architecture solves a real problem — the cost and complexity of maintaining separate lake and warehouse systems. With open table formats providing ACID transactions on cheap object storage, the technical foundation is solid.

But it's not magic. You need the right team, the right workload profile (both BI and ML), and the operational discipline to manage compaction, governance, and data quality. Start with a single use case, prove value with one gold-layer table, then expand. Don't boil the ocean.

Continue Reading

Engineering10 min

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

May 11, 2026Read

Engineering11 min

Airflow vs Dagster vs Prefect: An Honest Comparison

An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.

Apr 3, 2026Read