Harbinger Explorer

Back to Knowledge Hub
Engineering
Published:

Data Vault Modeling: Hubs, Links, and Satellites Explained

11 min read·Tags: data vault, data modeling, data warehousing, hubs links satellites, enterprise data, data engineering, dwh

Your star schema looked great at launch. Eighteen months later, you've got 14 source systems feeding it, three teams arguing about who owns the customer dimension, and a change request for a new source that requires restructuring four fact tables. Kimball is fine for stable domains. It wasn't designed for this.

Data Vault is a modeling methodology built specifically for enterprise-scale, multi-source data warehouses where source systems change frequently and auditability is non-negotiable. Here's what it is, how it works, and when to use it.


TL;DR

DimensionKimball (Star Schema)Data Vault
Query performanceHigh (out of the box)Medium (requires business vault/mart layer)
Source system changesDisruptiveAdditive — no existing tables modified
AuditabilityLimitedFull load date + record source on every row
Parallel loadingModerateHigh — tables are independent
Learning curveLowHigh
Best forStable domains, BI-firstMulti-source, regulated, agile environments

The Three Core Constructs

Data Vault has three fundamental building blocks. Once you understand these, the methodology clicks.

1. Hubs — Business Keys

A Hub represents one business entity (Customer, Product, Order) and stores only the business key — the identifier that the business uses, independent of any source system.

-- Data Vault 2.0 Hub: Customer
-- (Standard SQL / Snowflake dialect)
CREATE TABLE hub_customer (
    hub_customer_hk   VARCHAR(32) NOT NULL,   -- MD5 or SHA-256 hash of business key
    customer_id       VARCHAR(50) NOT NULL,    -- The raw business key
    load_date         TIMESTAMP   NOT NULL,    -- When this record was first loaded
    record_source     VARCHAR(100) NOT NULL,   -- Which system sourced this key
    CONSTRAINT pk_hub_customer PRIMARY KEY (hub_customer_hk)
);

Rules:

  • Hubs never change once a row is inserted (insert-only)
  • The hash key is deterministic: same business key -> same hash, always
  • No descriptive attributes — those belong in Satellites

2. Links — Relationships

A Link represents a relationship between two or more Hubs. An order placed by a customer? That's a Link between Hub_Customer and Hub_Order.

-- Data Vault 2.0 Link: Order placed by Customer
CREATE TABLE link_customer_order (
    link_customer_order_hk   VARCHAR(32) NOT NULL,  -- Hash of all parent hub keys
    hub_customer_hk          VARCHAR(32) NOT NULL,  -- FK -> hub_customer
    hub_order_hk             VARCHAR(32) NOT NULL,  -- FK -> hub_order
    load_date                TIMESTAMP   NOT NULL,
    record_source            VARCHAR(100) NOT NULL,
    CONSTRAINT pk_link_customer_order PRIMARY KEY (link_customer_order_hk)
);

Rules:

  • Links are also insert-only
  • The link hash key is derived from all participating hub keys
  • Links model facts about relationships (an order belongs to a customer), not business attributes

3. Satellites — Descriptive Attributes

Satellites store all the descriptive, changing data around a Hub or Link. Customer name, email, address — all in a Satellite hanging off Hub_Customer.

-- Data Vault 2.0 Satellite: Customer attributes from CRM
CREATE TABLE sat_customer_crm (
    hub_customer_hk   VARCHAR(32) NOT NULL,
    load_date         TIMESTAMP   NOT NULL,
    load_end_date     TIMESTAMP,              -- NULL = current record
    hash_diff         VARCHAR(32) NOT NULL,   -- Hash of all attribute values
    first_name        VARCHAR(100),
    last_name         VARCHAR(100),
    email             VARCHAR(200),
    country           VARCHAR(50),
    record_source     VARCHAR(100) NOT NULL,
    CONSTRAINT pk_sat_customer_crm PRIMARY KEY (hub_customer_hk, load_date)
);

Rules:

  • Satellites are the only tables that track history (type-2 SCD by default)
  • Each source system gets its own Satellite — CRM customer data never mixes with ERP customer data
  • hash_diff enables efficient change detection on load

The Data Vault Architecture

Loading diagram...

① Source systems load independently into the Raw Vault — no cross-system dependencies. ② The Business Vault adds computed attributes and point-in-time structures. ③ Information Marts expose optimized dimensional models for BI tools.

Legend: 🔵 Source Systems   🟢 Hubs   🟡 Links   🔴 Satellites   🟣 Business Vault


Point-in-Time (PIT) Tables

Multiple Satellites hanging off one Hub create a performance problem: joining across five Satellites to reconstruct a customer record at a point in time requires complex SQL. PIT tables solve this.

-- PIT table: Customer as of a specific date
-- (Standard SQL / Snowflake dialect)
CREATE TABLE pit_customer AS
SELECT
    snap.snapshot_date,
    h.hub_customer_hk,
    -- For each satellite, get the latest load_date <= snapshot_date
    MAX(CASE WHEN s_crm.load_date <= snap.snapshot_date THEN s_crm.load_date END)
        AS sat_crm_load_date,
    MAX(CASE WHEN s_erp.load_date <= snap.snapshot_date THEN s_erp.load_date END)
        AS sat_erp_load_date
FROM hub_customer h
CROSS JOIN (
    SELECT DISTINCT CAST(load_date AS DATE) AS snapshot_date FROM sat_customer_crm
) snap
LEFT JOIN sat_customer_crm s_crm
    ON h.hub_customer_hk = s_crm.hub_customer_hk
    AND s_crm.load_date <= snap.snapshot_date
LEFT JOIN sat_customer_erp s_erp
    ON h.hub_customer_hk = s_erp.hub_customer_hk
    AND s_erp.load_date <= snap.snapshot_date
GROUP BY snap.snapshot_date, h.hub_customer_hk;

PIT tables are precomputed and rebuilt on each load. They make downstream Satellite joins trivial — one join per Satellite using the PIT-resolved load_date.


Hash Key Generation

Hash keys must be deterministic and consistent across all loads. In practice, teams use either MD5 (128-bit) or SHA-256 (256-bit) on the upper-cased, trimmed business key.

# Python — deterministic hash key generation
import hashlib

def generate_hub_hk(business_key: str) -> str:
    # Normalize: uppercase, strip whitespace
    normalized = str(business_key).upper().strip()
    return hashlib.md5(normalized.encode("utf-8")).hexdigest()

def generate_link_hk(*hub_hks: str) -> str:
    # Concatenate sorted hub keys to ensure order-independence
    combined = "||".join(sorted(hub_hks))
    return hashlib.md5(combined.encode("utf-8")).hexdigest()

# Example
customer_hk = generate_hub_hk("CUST-00123")  # consistent hash every time
order_hk = generate_hub_hk("ORD-98765")
link_hk = generate_link_hk(customer_hk, order_hk)

Some teams prefer SHA-256 to reduce collision risk in very large datasets. Either works — just be consistent across your entire vault.


Data Vault 2.0 Additions

Dan Linstedt's Data Vault 2.0 (published ~2013) added several constructs beyond the original methodology:

ConstructPurpose
Reference TablesShared lookup tables (country codes, status values) shared across Hubs
Same-As Links (SAL)Map duplicate business keys discovered after load (deduplication without modifying Hubs)
Computed SatellitesBusiness-rule-derived data computed from Raw Vault data
Effectivity SatellitesTrack when a Link relationship was active vs merely existed
Business VaultComputed, joined, rule-applied layer on top of Raw Vault

For most teams starting out, focus on Hubs, Links, and Satellites first. Add PIT tables when query performance becomes an issue. The other constructs come later.


Why Data Vault for Enterprise?

1. Auditability is built in. Every row carries load_date and record_source. You can always answer "what did our CRM say about customer X on March 3rd, 2022 at 14:32?" — a question that destroys Kimball star schemas.

2. Source systems can be added without modifying existing tables. New ERP system? Add a new Satellite to Hub_Customer. Existing pipelines are untouched. This is the core architectural advantage.

3. Parallel loading. Hubs, Links, and Satellites have no load-order dependencies on each other (within a load cycle). This enables highly parallelized ETL.

4. Handles conflicting data gracefully. CRM says a customer lives in Germany. ERP says France. In a star schema, one source wins. In Data Vault, both live in separate Satellites — unresolved conflict is explicit, not hidden.


When Data Vault Is NOT the Right Choice

  • Small, stable domains with 1-2 source systems — the overhead isn't justified. Use Kimball.
  • Teams without dedicated data engineering resources — Data Vault requires disciplined tooling and modeling expertise.
  • Ad-hoc analytics platforms — If your primary output is exploratory dashboards, the Information Mart layer overhead adds friction without benefit.
  • Time to first dashboard matters most — You can't query a Raw Vault directly in most BI tools without a mart layer. Add 4-8 weeks of extra work to get there.

Tooling

Most modern Data Vault implementations use dbt for the transformation layer. Libraries like dbt-datavault4dbt and AutomateDV (formerly dbtvault) provide macros that generate Hub/Link/Satellite loads from YAML configuration, dramatically reducing boilerplate.

See What Is dbt? for a primer on dbt if you're not already using it.


Common Mistakes

1. Putting business logic in the Raw Vault. The Raw Vault is source-faithful — load what the source gives you, transformation-free. Business rules go in the Business Vault or mart layer.

2. Modeling relationships as Hub attributes. "Customer has a preferred_region foreign key" — that relationship belongs in a Link, not a Satellite attribute.

3. One giant Satellite per Hub. Split Satellites by source system and by rate-of-change. Attributes that change frequently (session data) shouldn't be in the same Satellite as slowly-changing attributes (customer name) — it causes excessive history rows.

4. Inconsistent hash key generation. If your Python loader and your dbt macros generate different hashes for the same business key, you'll silently create duplicate Hubs. Standardize hash generation in a shared library.


The Bottom Line

Data Vault trades query simplicity for architectural resilience. For enterprise data warehouses with many source systems, frequent changes, and regulatory audit requirements, that trade is almost always worth it.

For smaller, stable domains: use Kimball. For large, evolving, multi-source enterprise DWHs: Data Vault is the methodology that was actually designed for your problem.

Next step: If you're building a Data Vault on a modern lakehouse, read Medallion Architecture Explained to see how the Raw/Business/Mart layers map onto Bronze/Silver/Gold.


Continue Reading


[VERIFY]: AutomateDV library current status and rename to automate-dv. [VERIFY]: dbt-datavault4dbt active maintenance status.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...