cloud-architecture

Published: Apr 12, 2026

Medallion vs Data Vault vs Star Schema: A Decision Framework

10 min read·Tags: medallion-architecture, data-vault, star-schema, data-modeling, lakehouse, architecture-patterns, dimensional-modeling

Your data platform runs on three layers of abstraction: ingestion, integration, and presentation. The real question isn't which modeling approach is "best" — it's which combination fits your architecture. Medallion, Data Vault, and Star Schema aren't competitors. They solve different problems at different layers, and the teams that understand this build platforms that actually scale.

TL;DR for Busy Architects

Approach	What It Is	Best Layer	Core Strength	Core Weakness
Medallion	Pipeline processing pattern (Bronze → Silver → Gold)	End-to-end data flow	Simple mental model, fast to implement	Not a modeling methodology — says nothing about how to model
Data Vault 2.0	Integration modeling methodology (Hubs, Links, Satellites)	Silver / Integration	Full audit trail, schema flexibility, parallel loading	Query complexity, steep learning curve, verbose schema
Star Schema	Presentation modeling (Facts, Dimensions)	Gold / Analytics	Fastest BI query performance, intuitive for analysts	Rigid schema, poor at tracking history without SCD patterns

The punchline: most production lakehouse platforms in 2026 use a hybrid — Medallion as the pipeline container, Data Vault principles in Silver for integration, and Star Schema in Gold for consumption. The question is whether your use case justifies the complexity of all three.

Medallion Architecture: The Pipeline Container

Medallion architecture, popularized by Databricks, organizes data into three progressive quality layers:

Bronze — Raw ingestion. Append-only, schema-on-read, full fidelity copy of source data.
Silver — Cleaned, deduplicated, conformed. Business keys resolved, data types enforced.
Gold — Aggregated, business-ready. Optimized for specific consumption patterns (dashboards, ML features, APIs).

What Medallion Is — and Isn't

Medallion is a data processing pattern, not a data modeling methodology. It tells you where data lives in its lifecycle but says nothing about how to structure it. You can put flat CSVs in Bronze, Data Vault models in Silver, and Star Schemas in Gold. Or you can put wide denormalized tables everywhere. Medallion doesn't care.

This is both its greatest strength and its most common source of confusion. Teams adopt Medallion thinking they've "chosen a modeling approach" when they've actually chosen a pipeline organization strategy.

When Medallion Alone Is Enough

For small-to-medium platforms with fewer than 20 source systems and a single analytics team, pure Medallion with simple cleaned tables in Silver and aggregated tables in Gold works well. The overhead of formal modeling methodologies isn't justified when your data estate fits in one person's head.

Decision signal: If your Silver layer is just "cleaned Bronze" and your Gold layer is just "grouped Silver," you're probably fine without Data Vault or formal dimensional modeling.

Data Vault 2.0: The Integration Engine

Data Vault is a modeling methodology designed for the integration layer. It breaks data into three entity types:

Hubs — Business keys (e.g., customer_id, order_number). One row per unique business entity.
Links — Relationships between Hubs (e.g., customer_placed_order). Many-to-many by design.
Satellites — Descriptive attributes attached to Hubs or Links, with full history tracking via load timestamps.

Why Data Vault Exists

Data Vault solves a specific problem: integrating data from many source systems while preserving full history and enabling parallel development. When you have 50+ source systems, each with their own definition of "customer," Data Vault gives you a structured way to:

Load sources independently (parallel ETL, no cross-dependencies)
Track every change with timestamps (full auditability)
Add new sources without restructuring existing models
Handle late-arriving data and corrections

The Real Cost of Data Vault

Data Vault is verbose. A single business concept like "customer order" that would be one or two tables in a Star Schema becomes five or more tables in Data Vault (Hub_Customer, Hub_Order, Link_Customer_Order, Sat_Customer_Details, Sat_Order_Details). Query complexity increases proportionally.

Loading diagram...

① Hubs (blue) store only business keys — they never change once created ② Links (amber) capture relationships — also insert-only, enabling full relationship history ③ Satellites (green) hold all descriptive data with temporal versioning

🔵 Hub = Business Key 🟡 Link = Relationship 🟢 Satellite = Attributes + History

This verbosity is the trade-off for auditability and flexibility. Teams that adopt Data Vault without needing those properties pay the complexity cost without the benefit.

When Data Vault Is Worth It

Regulated industries (finance, healthcare, pharma) where full audit trails are non-negotiable
Many source systems (30+) with overlapping business concepts
Frequently changing source schemas where rigid models break
Multiple development teams that need to work on different sources in parallel without merge conflicts
Compliance requirements for data lineage and point-in-time reconstruction

When Data Vault Is Overkill

Single-cloud analytics platforms with fewer than 10 sources
Startups where speed of delivery matters more than audit compliance
Teams without dedicated data modelers (Data Vault requires modeling expertise)
Use cases where nobody will ever ask "what did this record look like on March 15th?"

Star Schema: The Analytics Accelerator

Star Schema (and its normalized cousin, Snowflake Schema) has been the standard for analytical modeling since Ralph Kimball's original work in the 1990s. It structures data into:

Fact Tables — Measurable events (orders, clicks, transactions). Contains foreign keys and metrics.
Dimension Tables — Descriptive context (customers, products, dates). Contains attributes for filtering and grouping.

Why Star Schema Still Dominates the Gold Layer

Despite being 30+ years old, Star Schema remains the most efficient structure for BI query patterns. The reasons are practical:

Query engines optimize for it. Columnar engines (Databricks SQL, BigQuery, Redshift, Snowflake) all optimize join patterns typical of Star Schema queries.
BI tools expect it. Power BI, Tableau, Looker, and virtually every BI tool generate optimal SQL against Star Schema structures.
Analysts understand it. The mental model of "events surrounded by context" maps directly to business questions.
Performance is predictable. Fact table scans with dimension lookups have well-understood performance characteristics.

Star Schema Limitations

Schema rigidity. Adding a new attribute to a dimension requires DDL changes and potentially backfilling.
History handling is bolted on. Slowly Changing Dimensions (SCD Type 2) add complexity that Star Schema wasn't originally designed for.
Poor fit for integration. Star Schema assumes data is already clean and conformed — it's a presentation model, not an integration model.
Denormalization trade-offs. Wide dimension tables duplicate data by design, which can cause update anomalies if not managed.

The Hybrid Pattern: Medallion + Data Vault + Star Schema

The most mature data platforms in 2026 use all three approaches at their appropriate layer. This isn't complexity for complexity's sake — it's placing each methodology where it adds the most value.

Loading diagram...

① Bronze receives raw data from all sources — append-only, full fidelity ② Silver uses Data Vault to integrate, deduplicate, and track history across sources ③ Gold materializes Star Schemas optimized for specific analytics use cases

🔘 Raw Sources 🔵 Hubs 🟡 Links 🟢 Satellites 🟣 Facts 🔴 Dimensions

How the Layers Interact

Layer	Methodology	Purpose	Schema Flexibility	Query Performance
Bronze	None (raw append)	Preserve source fidelity	Maximum — raw schema	Not queried directly
Silver	Data Vault 2.0	Integrate, deduplicate, historize	High — add Satellites freely	Moderate — complex joins
Gold	Star Schema	Serve analytics and BI	Low — structured for consumption	Maximum — BI-optimized

When the Full Hybrid Is Justified

The full three-layer hybrid is appropriate when:

You have 15+ heterogeneous source systems feeding a shared data platform
Multiple consuming teams need different views of the same data (finance, operations, marketing)
Regulatory compliance requires point-in-time data reconstruction
Source schemas change frequently (SaaS APIs, third-party feeds)
You have dedicated data engineering capacity to maintain the modeling layers

When to Skip the Middle

If your platform is primarily serving a single analytics team from fewer than 10 well-structured sources, skip Data Vault in Silver. Use Medallion with cleaned/conformed tables in Silver and Star Schema in Gold. This covers 70% of mid-market data platforms.

Decision Framework: Choosing Your Architecture

Step 1: Count Your Sources and Teams

Scenario	Recommended Approach
< 10 sources, 1 analytics team	Medallion only (cleaned tables in Silver, aggregates in Gold)
10–30 sources, 2–3 consuming teams	Medallion + Star Schema in Gold
30+ sources, multiple teams, compliance needs	Full hybrid: Medallion + Data Vault in Silver + Star Schema in Gold
Single-source operational analytics	Star Schema directly — skip the pipeline layers

Step 2: Assess Your Compliance Requirements

If you need to answer "What did the data look like on date X?" with certainty, you need either:

Data Vault with satellite history tracking (structured approach)
Delta Lake / Iceberg time travel (infrastructure approach)

For most teams in 2026, table format time travel (Delta Lake's VERSION AS OF, Iceberg's snapshot queries) provides sufficient point-in-time capability without the full Data Vault overhead. Data Vault becomes necessary when you need cross-source history integration — not just single-table versioning.

Step 3: Evaluate Team Skills

This is the factor most architecture diagrams ignore. Data Vault requires modeling expertise that many teams don't have. An incorrectly implemented Data Vault is worse than no Data Vault — it adds complexity without delivering the auditability and flexibility benefits.

Team Profile	Modeling Recommendation
Junior/mid data engineers, no modeler	Medallion + wide Silver tables + Gold aggregates
Experienced engineers, 1+ data modeler	Medallion + Star Schema (Silver conformed, Gold dimensional)
Dedicated modeling team, compliance needs	Full hybrid with Data Vault

Step 4: Consider Your Cloud Platform

Different platforms have different strengths that affect the modeling decision:

Databricks — Strong Medallion support via Delta Live Tables. Data Vault works well with Unity Catalog's lineage tracking. Star Schema benefits from Photon engine optimizations.
Snowflake — Dynamic tables can automate Star Schema materialization. Data Vault patterns benefit from zero-copy cloning for testing. [PRICING-CHECK]
BigQuery — Materialized views handle Gold-layer aggregation natively. Nested/repeated fields can reduce Data Vault join complexity in Silver.
Microsoft Fabric — Medallion is a first-class concept. OneLake shortcuts simplify Bronze ingestion. DirectLake mode optimizes Star Schema consumption in Power BI.

Common Anti-Patterns

Anti-Pattern 1: Data Vault Everywhere

Using Hub-Link-Satellite structures in the Gold layer forces analysts to write complex multi-join queries for simple questions. Data Vault is an integration methodology — keep it in Silver and materialize simpler structures for consumption.

Anti-Pattern 2: Medallion Without Modeling

Labeling tables as "Bronze," "Silver," and "Gold" without applying any modeling discipline means your Gold layer is just "less messy Bronze." If Gold tables don't have defined grain, clear business keys, and documented relationships, they're not Gold — they're spray-painted Silver.

Anti-Pattern 3: Star Schema as Integration Layer

Building Star Schemas directly from raw sources means every source change requires restructuring your dimensional model. This is brittle and doesn't scale beyond a handful of sources. Star Schema is a presentation pattern — it needs clean, integrated data as input.

Anti-Pattern 4: Over-Engineering for Scale You Don't Have

A 5-person startup with 3 data sources doesn't need Data Vault. The overhead of maintaining Hubs, Links, and Satellites will consume more engineering time than it saves. Start simple, add complexity when the pain justifies it.

The Practical Playbook

Here's what actually works for most teams starting a new lakehouse platform in 2026:

Phase 1 — Get data flowing (Week 1–4)

Set up Medallion layers (Bronze, Silver, Gold)
Bronze: raw ingestion with Delta Lake or Iceberg
Silver: cleaned, typed, deduplicated tables (no formal modeling yet)
Gold: aggregated views for immediate analytics needs

Phase 2 — Introduce dimensional modeling (Month 2–3)

Model Gold layer as proper Star Schemas for primary BI use cases
Define fact table grain, dimension conformance, SCD strategy
This alone covers most analytics requirements

Phase 3 — Add Data Vault if needed (Month 4+, only if justified)

Introduce Data Vault in Silver only when source complexity or compliance demands it
Start with the highest-value integration point (usually customer or product)
Don't boil the ocean — vault incrementally

If you're exploring data from multiple APIs and want to understand source structures before committing to a modeling approach, tools like Harbinger Explorer let you query raw API data with SQL directly in the browser — useful for profiling sources during the Bronze-layer design phase.

Continue Reading

cloud-architecture14 min

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.

Apr 3, 2026Read

cloud-architecture11 min

Cloud Cost Allocation Strategies for Data Teams

A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.

Apr 3, 2026Read

cloud-architecture13 min

API Gateway Architecture Patterns for Data Platforms

A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.

Apr 3, 2026Read

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

TL;DR for Busy Architects

Medallion Architecture: The Pipeline Container

What Medallion Is — and Isn't

When Medallion Alone Is Enough

Data Vault 2.0: The Integration Engine

Why Data Vault Exists

The Real Cost of Data Vault

When Data Vault Is Worth It

When Data Vault Is Overkill

Star Schema: The Analytics Accelerator

Why Star Schema Still Dominates the Gold Layer

Star Schema Limitations

The Hybrid Pattern: Medallion + Data Vault + Star Schema

How the Layers Interact

When the Full Hybrid Is Justified

When to Skip the Middle

Decision Framework: Choosing Your Architecture

Step 1: Count Your Sources and Teams

Step 2: Assess Your Compliance Requirements

Step 3: Evaluate Team Skills

Step 4: Consider Your Cloud Platform

Common Anti-Patterns

Anti-Pattern 1: Data Vault Everywhere

Anti-Pattern 2: Medallion Without Modeling

Anti-Pattern 3: Star Schema as Integration Layer

Anti-Pattern 4: Over-Engineering for Scale You Don't Have

The Practical Playbook

Further Reading

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette