Medallion vs Data Vault vs Star Schema: A Decision Framework
Your data platform runs on three layers of abstraction: ingestion, integration, and presentation. The real question isn't which modeling approach is "best" — it's which combination fits your architecture. Medallion, Data Vault, and Star Schema aren't competitors. They solve different problems at different layers, and the teams that understand this build platforms that actually scale.
TL;DR for Busy Architects
| Approach | What It Is | Best Layer | Core Strength | Core Weakness |
|---|---|---|---|---|
| Medallion | Pipeline processing pattern (Bronze → Silver → Gold) | End-to-end data flow | Simple mental model, fast to implement | Not a modeling methodology — says nothing about how to model |
| Data Vault 2.0 | Integration modeling methodology (Hubs, Links, Satellites) | Silver / Integration | Full audit trail, schema flexibility, parallel loading | Query complexity, steep learning curve, verbose schema |
| Star Schema | Presentation modeling (Facts, Dimensions) | Gold / Analytics | Fastest BI query performance, intuitive for analysts | Rigid schema, poor at tracking history without SCD patterns |
The punchline: most production lakehouse platforms in 2026 use a hybrid — Medallion as the pipeline container, Data Vault principles in Silver for integration, and Star Schema in Gold for consumption. The question is whether your use case justifies the complexity of all three.
Medallion Architecture: The Pipeline Container
Medallion architecture, popularized by Databricks, organizes data into three progressive quality layers:
- Bronze — Raw ingestion. Append-only, schema-on-read, full fidelity copy of source data.
- Silver — Cleaned, deduplicated, conformed. Business keys resolved, data types enforced.
- Gold — Aggregated, business-ready. Optimized for specific consumption patterns (dashboards, ML features, APIs).
What Medallion Is — and Isn't
Medallion is a data processing pattern, not a data modeling methodology. It tells you where data lives in its lifecycle but says nothing about how to structure it. You can put flat CSVs in Bronze, Data Vault models in Silver, and Star Schemas in Gold. Or you can put wide denormalized tables everywhere. Medallion doesn't care.
This is both its greatest strength and its most common source of confusion. Teams adopt Medallion thinking they've "chosen a modeling approach" when they've actually chosen a pipeline organization strategy.
When Medallion Alone Is Enough
For small-to-medium platforms with fewer than 20 source systems and a single analytics team, pure Medallion with simple cleaned tables in Silver and aggregated tables in Gold works well. The overhead of formal modeling methodologies isn't justified when your data estate fits in one person's head.
Decision signal: If your Silver layer is just "cleaned Bronze" and your Gold layer is just "grouped Silver," you're probably fine without Data Vault or formal dimensional modeling.
Data Vault 2.0: The Integration Engine
Data Vault is a modeling methodology designed for the integration layer. It breaks data into three entity types:
- Hubs — Business keys (e.g.,
customer_id,order_number). One row per unique business entity. - Links — Relationships between Hubs (e.g.,
customer_placed_order). Many-to-many by design. - Satellites — Descriptive attributes attached to Hubs or Links, with full history tracking via load timestamps.
Why Data Vault Exists
Data Vault solves a specific problem: integrating data from many source systems while preserving full history and enabling parallel development. When you have 50+ source systems, each with their own definition of "customer," Data Vault gives you a structured way to:
- Load sources independently (parallel ETL, no cross-dependencies)
- Track every change with timestamps (full auditability)
- Add new sources without restructuring existing models
- Handle late-arriving data and corrections
The Real Cost of Data Vault
Data Vault is verbose. A single business concept like "customer order" that would be one or two tables in a Star Schema becomes five or more tables in Data Vault (Hub_Customer, Hub_Order, Link_Customer_Order, Sat_Customer_Details, Sat_Order_Details). Query complexity increases proportionally.
Loading diagram...
① Hubs (blue) store only business keys — they never change once created ② Links (amber) capture relationships — also insert-only, enabling full relationship history ③ Satellites (green) hold all descriptive data with temporal versioning
🔵 Hub = Business Key 🟡 Link = Relationship 🟢 Satellite = Attributes + History
This verbosity is the trade-off for auditability and flexibility. Teams that adopt Data Vault without needing those properties pay the complexity cost without the benefit.
When Data Vault Is Worth It
- Regulated industries (finance, healthcare, pharma) where full audit trails are non-negotiable
- Many source systems (30+) with overlapping business concepts
- Frequently changing source schemas where rigid models break
- Multiple development teams that need to work on different sources in parallel without merge conflicts
- Compliance requirements for data lineage and point-in-time reconstruction
When Data Vault Is Overkill
- Single-cloud analytics platforms with fewer than 10 sources
- Startups where speed of delivery matters more than audit compliance
- Teams without dedicated data modelers (Data Vault requires modeling expertise)
- Use cases where nobody will ever ask "what did this record look like on March 15th?"
Star Schema: The Analytics Accelerator
Star Schema (and its normalized cousin, Snowflake Schema) has been the standard for analytical modeling since Ralph Kimball's original work in the 1990s. It structures data into:
- Fact Tables — Measurable events (orders, clicks, transactions). Contains foreign keys and metrics.
- Dimension Tables — Descriptive context (customers, products, dates). Contains attributes for filtering and grouping.
Why Star Schema Still Dominates the Gold Layer
Despite being 30+ years old, Star Schema remains the most efficient structure for BI query patterns. The reasons are practical:
- Query engines optimize for it. Columnar engines (Databricks SQL, BigQuery, Redshift, Snowflake) all optimize join patterns typical of Star Schema queries.
- BI tools expect it. Power BI, Tableau, Looker, and virtually every BI tool generate optimal SQL against Star Schema structures.
- Analysts understand it. The mental model of "events surrounded by context" maps directly to business questions.
- Performance is predictable. Fact table scans with dimension lookups have well-understood performance characteristics.
Star Schema Limitations
- Schema rigidity. Adding a new attribute to a dimension requires DDL changes and potentially backfilling.
- History handling is bolted on. Slowly Changing Dimensions (SCD Type 2) add complexity that Star Schema wasn't originally designed for.
- Poor fit for integration. Star Schema assumes data is already clean and conformed — it's a presentation model, not an integration model.
- Denormalization trade-offs. Wide dimension tables duplicate data by design, which can cause update anomalies if not managed.
The Hybrid Pattern: Medallion + Data Vault + Star Schema
The most mature data platforms in 2026 use all three approaches at their appropriate layer. This isn't complexity for complexity's sake — it's placing each methodology where it adds the most value.
Loading diagram...
① Bronze receives raw data from all sources — append-only, full fidelity ② Silver uses Data Vault to integrate, deduplicate, and track history across sources ③ Gold materializes Star Schemas optimized for specific analytics use cases
🔘 Raw Sources 🔵 Hubs 🟡 Links 🟢 Satellites 🟣 Facts 🔴 Dimensions
How the Layers Interact
| Layer | Methodology | Purpose | Schema Flexibility | Query Performance |
|---|---|---|---|---|
| Bronze | None (raw append) | Preserve source fidelity | Maximum — raw schema | Not queried directly |
| Silver | Data Vault 2.0 | Integrate, deduplicate, historize | High — add Satellites freely | Moderate — complex joins |
| Gold | Star Schema | Serve analytics and BI | Low — structured for consumption | Maximum — BI-optimized |
When the Full Hybrid Is Justified
The full three-layer hybrid is appropriate when:
- You have 15+ heterogeneous source systems feeding a shared data platform
- Multiple consuming teams need different views of the same data (finance, operations, marketing)
- Regulatory compliance requires point-in-time data reconstruction
- Source schemas change frequently (SaaS APIs, third-party feeds)
- You have dedicated data engineering capacity to maintain the modeling layers
When to Skip the Middle
If your platform is primarily serving a single analytics team from fewer than 10 well-structured sources, skip Data Vault in Silver. Use Medallion with cleaned/conformed tables in Silver and Star Schema in Gold. This covers 70% of mid-market data platforms.
Decision Framework: Choosing Your Architecture
Step 1: Count Your Sources and Teams
| Scenario | Recommended Approach |
|---|---|
| < 10 sources, 1 analytics team | Medallion only (cleaned tables in Silver, aggregates in Gold) |
| 10–30 sources, 2–3 consuming teams | Medallion + Star Schema in Gold |
| 30+ sources, multiple teams, compliance needs | Full hybrid: Medallion + Data Vault in Silver + Star Schema in Gold |
| Single-source operational analytics | Star Schema directly — skip the pipeline layers |
Step 2: Assess Your Compliance Requirements
If you need to answer "What did the data look like on date X?" with certainty, you need either:
- Data Vault with satellite history tracking (structured approach)
- Delta Lake / Iceberg time travel (infrastructure approach)
For most teams in 2026, table format time travel (Delta Lake's VERSION AS OF, Iceberg's snapshot queries) provides sufficient point-in-time capability without the full Data Vault overhead. Data Vault becomes necessary when you need cross-source history integration — not just single-table versioning.
Step 3: Evaluate Team Skills
This is the factor most architecture diagrams ignore. Data Vault requires modeling expertise that many teams don't have. An incorrectly implemented Data Vault is worse than no Data Vault — it adds complexity without delivering the auditability and flexibility benefits.
| Team Profile | Modeling Recommendation |
|---|---|
| Junior/mid data engineers, no modeler | Medallion + wide Silver tables + Gold aggregates |
| Experienced engineers, 1+ data modeler | Medallion + Star Schema (Silver conformed, Gold dimensional) |
| Dedicated modeling team, compliance needs | Full hybrid with Data Vault |
Step 4: Consider Your Cloud Platform
Different platforms have different strengths that affect the modeling decision:
- Databricks — Strong Medallion support via Delta Live Tables. Data Vault works well with Unity Catalog's lineage tracking. Star Schema benefits from Photon engine optimizations.
- Snowflake — Dynamic tables can automate Star Schema materialization. Data Vault patterns benefit from zero-copy cloning for testing. [PRICING-CHECK]
- BigQuery — Materialized views handle Gold-layer aggregation natively. Nested/repeated fields can reduce Data Vault join complexity in Silver.
- Microsoft Fabric — Medallion is a first-class concept. OneLake shortcuts simplify Bronze ingestion. DirectLake mode optimizes Star Schema consumption in Power BI.
Common Anti-Patterns
Anti-Pattern 1: Data Vault Everywhere
Using Hub-Link-Satellite structures in the Gold layer forces analysts to write complex multi-join queries for simple questions. Data Vault is an integration methodology — keep it in Silver and materialize simpler structures for consumption.
Anti-Pattern 2: Medallion Without Modeling
Labeling tables as "Bronze," "Silver," and "Gold" without applying any modeling discipline means your Gold layer is just "less messy Bronze." If Gold tables don't have defined grain, clear business keys, and documented relationships, they're not Gold — they're spray-painted Silver.
Anti-Pattern 3: Star Schema as Integration Layer
Building Star Schemas directly from raw sources means every source change requires restructuring your dimensional model. This is brittle and doesn't scale beyond a handful of sources. Star Schema is a presentation pattern — it needs clean, integrated data as input.
Anti-Pattern 4: Over-Engineering for Scale You Don't Have
A 5-person startup with 3 data sources doesn't need Data Vault. The overhead of maintaining Hubs, Links, and Satellites will consume more engineering time than it saves. Start simple, add complexity when the pain justifies it.
The Practical Playbook
Here's what actually works for most teams starting a new lakehouse platform in 2026:
Phase 1 — Get data flowing (Week 1–4)
- Set up Medallion layers (Bronze, Silver, Gold)
- Bronze: raw ingestion with Delta Lake or Iceberg
- Silver: cleaned, typed, deduplicated tables (no formal modeling yet)
- Gold: aggregated views for immediate analytics needs
Phase 2 — Introduce dimensional modeling (Month 2–3)
- Model Gold layer as proper Star Schemas for primary BI use cases
- Define fact table grain, dimension conformance, SCD strategy
- This alone covers most analytics requirements
Phase 3 — Add Data Vault if needed (Month 4+, only if justified)
- Introduce Data Vault in Silver only when source complexity or compliance demands it
- Start with the highest-value integration point (usually customer or product)
- Don't boil the ocean — vault incrementally
If you're exploring data from multiple APIs and want to understand source structures before committing to a modeling approach, tools like Harbinger Explorer let you query raw API data with SQL directly in the browser — useful for profiling sources during the Bronze-layer design phase.
Further Reading
- Data Lakehouse Architecture Explained — Foundation concepts for the platform these models sit on
- Medallion Architecture Explained — Deep dive into Bronze, Silver, Gold layer design
- Data Vault Modeling: Hubs, Links, and Satellites — Practical implementation guide for Data Vault 2.0
References:
- Databricks: What is Medallion Architecture?
- Microsoft Learn: Medallion Lakehouse Architecture
- Dan Linstedt: Data Vault 2.0 Standards
- Ralph Kimball Group: Dimensional Modeling Techniques
[PRICING-CHECK] Snowflake dynamic tables pricing model — verify current credit consumption rates for automated materialization (April 2026)
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial