Data Lakehouse Architecture Explained
Data lakehouse architecture promises the best of both worlds: the cheap, scalable storage of a data lake with the reliability and performance of a data warehouse. But does it actually deliver — and when should you think twice before adopting one?
This article breaks down how data lakehouse architecture works, compares it to traditional approaches, and walks through the practical trade-offs data engineers face when building one.
What Is a Data Lakehouse?
A data lakehouse is a data management architecture that combines the low-cost object storage of data lakes with the transactional guarantees and schema enforcement of data warehouses. The key enabler: open table formats like Delta Lake, Apache Iceberg, and Apache Hudi that add ACID transactions, time travel, and schema evolution on top of Parquet files stored in cloud object storage.
Instead of maintaining separate systems for raw data (lake) and curated data (warehouse), a lakehouse collapses them into a single architecture with distinct layers.
Lakehouse Architecture Layers
The typical data lakehouse architecture follows a medallion pattern — bronze, silver, and gold layers — built on top of cloud object storage.
Loading diagram...
① Cloud Object Storage — All data lives on cheap, scalable blob storage (S3, ADLS, GCS). No proprietary formats.
② Bronze Layer — Raw data lands here unchanged. Append-only ingestion from APIs, CDC streams, or batch loads. Schema-on-read.
③ Silver Layer — Data is cleaned, deduplicated, and conformed to standard schemas. This is where data quality enforcement happens.
④ Gold Layer — Business-ready aggregates, dimensions, and fact tables optimized for query performance.
⑤ Consumers — BI dashboards, ML training pipelines, and ad-hoc analytics all query the same gold-layer tables.
🟦 Storage 🟨 Raw Data 🟩 Cleaned Data 🟪 Curated Data 🟥 Consumers ⬜ Infrastructure
Delta Lake Architecture: The Most Common Implementation
When people say "data lakehouse architecture," they often mean Delta Lake — the open table format originally built by Databricks. The Databricks lakehouse platform has made Delta Lake the de facto standard for lakehouse implementations, though Apache Iceberg is gaining ground fast.
Delta Lake adds a transaction log (_delta_log/) on top of Parquet files. Every write — insert, update, delete, or schema change — creates a new JSON entry in this log, enabling:
- ACID transactions across concurrent reads and writes
- Time travel via versioned snapshots
- Schema enforcement and evolution without rewriting data
- Z-ordering and data skipping for query performance
Here's what a typical Delta Lake write looks like in PySpark:
# PySpark — Writing to a Delta Lake silver layer table
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Read bronze-layer raw events
bronze_df = spark.read.format("delta").load("s3://my-lake/bronze/events/")
# Clean and deduplicate for silver layer
silver_df = (
bronze_df
.dropDuplicates(["event_id"])
.filter(col("event_timestamp").isNotNull())
.withColumn("processed_at", current_timestamp())
.withColumn("row_hash", sha2(concat_ws("||", *bronze_df.columns), 256))
)
# Write with merge schema to handle evolution
silver_df.write \
.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.save("s3://my-lake/silver/events/")
Lakehouse vs Data Warehouse vs Data Lake
The lakehouse vs data warehouse debate isn't about one being universally better. Each architecture makes different trade-offs:
| Dimension | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage format | Open (Parquet, ORC, JSON) | Proprietary | Open (Delta, Iceberg, Hudi) |
| ACID transactions | ❌ No | ✅ Yes | ✅ Yes |
| Schema enforcement | Schema-on-read | Schema-on-write | Both supported |
| Query performance | Slow without optimization | Fast (optimized engine) | Fast with tuning (Z-order, compaction) |
| Cost at scale | Low (object storage) | High (compute-storage coupled) | Low-medium (object storage + compute) |
| ML/Data Science support | ✅ Direct file access | ❌ Export required | ✅ Direct file access |
| Real-time ingestion | ✅ Streaming native | ⚠️ Micro-batch typical | ✅ Streaming native |
| Governance & lineage | ❌ Manual | ✅ Built-in | ✅ Via Unity Catalog / Polaris |
| Vendor lock-in | Low | High | Low-medium (format-dependent) |
| Maturity | 10+ years | 30+ years | 3-5 years |
When to choose a data warehouse: Your workload is primarily SQL analytics, your team is SQL-first, data volumes are moderate (< 10 TB), and you value simplicity over flexibility.
When to choose a data lake: You need cheap storage for massive volumes of unstructured data (images, logs, video) and your consumers are data scientists working in Python/Spark.
When to choose a data lakehouse: You want a single architecture serving both BI and ML workloads, you're dealing with large-scale structured and semi-structured data, and your team can handle the added operational complexity.
Querying Gold-Layer Lakehouse Tables
Once your lakehouse pipeline produces gold-layer tables, querying them with Spark SQL is straightforward:
-- Spark SQL — Query gold-layer aggregates with time travel
SELECT
region,
product_category,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
AVG(order_value) AS avg_order_value
FROM delta.`s3://my-lake/gold/order_summary/`
VERSION AS OF 42 -- time travel to specific version
WHERE order_date >= '2026-01-01'
GROUP BY region, product_category
ORDER BY total_revenue DESC;
This is also where the lakehouse shines compared to a traditional data lake: you get warehouse-style SQL semantics with data lake economics.
If you're working with gold-layer tables exported as CSV or exposed through APIs, tools like Harbinger Explorer let you query them directly in the browser using DuckDB WASM — useful for quick ad-hoc exploration without spinning up a Spark cluster.
Common Pitfalls
Building a data lakehouse architecture isn't just "put Delta Lake on S3 and call it done." These are the mistakes that burn teams:
1. Skipping compaction
Small files kill query performance. If your ingestion produces thousands of tiny Parquet files (common with streaming), you need regular OPTIMIZE runs. This isn't optional — it's operational hygiene.
2. Ignoring partition strategy
Over-partitioning on high-cardinality columns (like user_id) creates the small-file problem in disguise. Under-partitioning means full table scans. Spend time modeling your access patterns before choosing partition columns.
3. No data quality enforcement
A lakehouse without data quality checks is just a data lake with extra steps. Use expectations (Great Expectations, Soda, or Delta Live Tables expectations) at the bronze-to-silver boundary.
4. Treating the gold layer as a dump
The gold layer should be a curated, well-documented set of business-ready tables — not a copy of silver with some aggregations. Define clear ownership, SLAs, and naming conventions.
5. Underestimating governance
Open table formats give you the building blocks, but you still need a catalog (Unity Catalog, AWS Glue, Project Nessie) to manage access control, lineage, and discoverability. This is especially true with the Databricks lakehouse stack, where Unity Catalog is increasingly the expected governance layer.
6. Premature optimization
Z-ordering, liquid clustering, bloom filters — these are powerful, but applying them before understanding your query patterns is wasted effort. Profile your workloads first, then tune.
When NOT to Use a Lakehouse
A data lakehouse architecture is not always the right answer. Skip it when:
-
Your data fits in PostgreSQL. If you're dealing with < 100 GB of structured data and your queries are simple, a relational database with dbt is simpler, cheaper, and easier to operate. Don't over-architect.
-
Your team is pure SQL, zero Spark. Lakehouses assume some comfort with Spark or similar distributed compute frameworks. If your whole team is SQL-only, a managed warehouse (Snowflake, BigQuery) will be more productive.
-
You have no ML or data science workload. The main advantage of a lakehouse over a warehouse is unified access for both BI and ML. If you're only doing dashboards and reports, a warehouse is simpler.
-
You can't invest in operations. Lakehouses require ongoing maintenance: compaction, vacuum, monitoring partition sizes, managing catalog metadata. If you don't have the engineering bandwidth, the operational overhead will eat your productivity gains.
-
Vendor lock-in is a non-concern. If you've already committed to a cloud warehouse and it's working, migrating to a lakehouse for theoretical benefits is rarely worth the disruption.
The Open Table Format Landscape
The choice of open table format shapes your lakehouse implementation. As of early 2026:
- Delta Lake — Strongest ecosystem (Databricks), best Spark integration, most mature. Default choice for Databricks lakehouse deployments.
- Apache Iceberg — Best multi-engine support (Spark, Trino, Flink, Snowflake), strongest partition evolution. Growing fast.
- Apache Hudi — Best for incremental/upsert-heavy workloads, strong CDC support. Smaller community.
My take: if you're on Databricks, Delta Lake is the pragmatic choice. If you want maximum engine flexibility, Iceberg is where the industry is converging. Hudi is solid but increasingly niche.
Key Takeaways
Data lakehouse architecture solves a real problem — the cost and complexity of maintaining separate lake and warehouse systems. With open table formats providing ACID transactions on cheap object storage, the technical foundation is solid.
But it's not magic. You need the right team, the right workload profile (both BI and ML), and the operational discipline to manage compaction, governance, and data quality. Start with a single use case, prove value with one gold-layer table, then expand. Don't boil the ocean.
Continue Reading
dbt vs Spark SQL: How to Choose
dbt or Spark SQL for your transformation layer? A side-by-side comparison of features, pricing, and use cases — with code examples for both and honest trade-offs for analytics engineers.
Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?
DLT vs classic ETL compared honestly: declarative expectations, streaming, debugging, testing, and pricing. Includes DLT code example with expectations syntax.
Medallion Architecture Explained
Medallion architecture (Bronze → Silver → Gold) explained for data engineers. Includes PySpark examples, layer comparison table, common pitfalls, and when not to use it.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial