Harbinger Explorer

Back to Knowledge Hub
Engineering
Published:

Surrogate vs Natural Keys: When to Use Which

7 min read·Tags: data modeling, surrogate keys, natural keys, data vault, SCD, star schema, database design

Surrogate vs Natural Keys: When to Use Which

You inherited a data warehouse where every dimension table uses the source system's customer_code as its primary key. Then the source system gets migrated and codes are reassigned. Now you're rebuilding half your model. This is the natural key problem.

But surrogate keys have their own failure modes. The debate between surrogate and natural keys is older than most modern data stacks — and the right answer genuinely depends on your context.

TL;DR

Natural KeySurrogate Key
DefinitionA key that exists in the real world (email, SSN, product code)A system-generated identifier with no business meaning
Examplescustomer_email, ISIN, order_referenceAuto-increment INT, UUID, hash key
StabilityBusiness-controlled, can changeSystem-controlled, never changes
DebuggabilityHigh — humans can read itLow — must join back to source
Join performanceDepends on data type and cardinalityINT/BIGINT is fast; UUID is slower
Recommended forLookup tables, reference data, small dimensionsFact tables, large dimensions, SCD tracking

Natural Keys: What They Are and Where They Break

A natural key is a column (or combination of columns) that uniquely identifies a record using real-world meaning. An ISIN identifies a financial instrument. An email address identifies a user. A product barcode identifies a product.

Where natural keys work well

-- PostgreSQL: ISO currency codes as natural key
-- Currencies don't change their codes. USD is USD. EUR is EUR.
CREATE TABLE currencies (
    currency_code CHAR(3) PRIMARY KEY,  -- natural key: ISO 4217
    currency_name TEXT NOT NULL,
    decimal_places SMALLINT
);

-- Safe join — currency_code is stable and universally understood
SELECT t.amount, c.currency_name
FROM transactions t
JOIN currencies c ON t.currency_code = c.currency_code;

ISO codes, country codes, standard enums — these are genuinely stable. Natural keys here add clarity without risk.

Where natural keys break

Natural keys fail when the business controls the value and the business changes:

  1. Reuse — A user deletes their account. A new user registers with the same email. Your history is now contaminated.
  2. Correction — A customer's national ID was entered with a typo. Fixing it cascades through every foreign key that referenced the old value.
  3. Merge events — Two companies merge. Their company_id values must be consolidated. Every downstream table breaks.
  4. Multi-source integration — System A uses USER_001; System B uses U-00001. They refer to the same person. Neither is a reliable primary key across the integrated model.
-- PostgreSQL: the multi-source problem
-- You can't use natural keys when two systems have different ID schemes
-- for the same real-world entity

-- Source A
INSERT INTO customers VALUES ('USER_001', 'Alice', 'alice@example.com');

-- Source B (same person, different system)
INSERT INTO customers VALUES ('U-00001', 'Alice', 'alice@example.com');

-- Which one is the primary key? Neither.
-- You need a surrogate to represent the unified entity.

Surrogate Keys: System-Generated, Business-Agnostic

A surrogate key is a meaningless identifier created by the data system. The most common forms:

  • Auto-increment integer (SERIAL, IDENTITY) — fast, compact, sequential
  • UUID — globally unique, safe for distributed generation, but larger
  • Hash key (used in Data Vault) — derived from natural key attributes, deterministic
-- PostgreSQL: surrogate key with IDENTITY
CREATE TABLE customers (
    customer_sk BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- surrogate
    customer_id TEXT UNIQUE NOT NULL,  -- natural key, still stored!
    customer_name TEXT,
    email TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

Note that customer_id (the natural key) is still stored as a regular column — not as the primary key. This is the standard pattern: use the surrogate as the join key, keep the natural key for lookups and debugging.

Surrogate keys in SCD Type 2

Slowly Changing Dimensions (SCD Type 2) are impossible without surrogate keys. When a customer changes their address, you insert a new version of the row with a new surrogate key — the old row and new row have the same natural key but different surrogate keys.

-- PostgreSQL: SCD Type 2 with surrogate key
CREATE TABLE dim_customer (
    customer_sk    BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id    TEXT NOT NULL,       -- natural key (same across versions)
    customer_name  TEXT,
    city           TEXT,
    valid_from     DATE NOT NULL,
    valid_to       DATE,                -- NULL means current record
    is_current     BOOLEAN DEFAULT TRUE
);

-- Customer moves from Berlin to Hamburg: new row, same customer_id
INSERT INTO dim_customer (customer_id, customer_name, city, valid_from, valid_to, is_current)
VALUES
    ('CUST-042', 'Markus Bauer', 'Berlin',  '2023-01-01', '2024-06-30', FALSE),
    ('CUST-042', 'Markus Bauer', 'Hamburg', '2024-07-01', NULL,         TRUE);

Without the surrogate customer_sk, your fact table references would break every time a customer record is versioned.


UUID vs Auto-Increment Integer

Both are surrogate keys, but they behave differently at scale.

Auto-increment (BIGINT)UUID (v4)
Size8 bytes16 bytes
Join performanceFast (integer comparison)Slower (string/byte comparison)
Index fragmentationLow (sequential inserts)High (random inserts)
Distributed generationRequires coordinationNo coordination needed
DebuggabilityLowSlightly better (globally unique)
Best forSingle-writer systemsDistributed / multi-source systems

UUID v7 (time-ordered UUIDs) is increasingly popular as it combines global uniqueness with sequential insertion order, reducing index fragmentation.


Hash Keys in Data Vault

Data Vault 2.0 uses hash keys — a specific type of surrogate key derived from the natural key using a deterministic hash function (typically SHA-256 or MD5). The hash is computed at load time and used as the Hub primary key.

# Python: compute a Data Vault hash key
import hashlib

def compute_hash_key(business_key: str) -> str:
    # Deterministic hash key for Data Vault Hub
    normalized = business_key.strip().upper()  # normalize before hashing
    return hashlib.sha256(normalized.encode("utf-8")).hexdigest()

# Example
customer_hk = compute_hash_key("CUST-042")
print(customer_hk)
# => 3e7c... (same every time for the same input)

The advantage: hash keys are deterministic across source systems. If System A and System B both load CUST-042, they generate the same hash key — enabling integration without a central sequence generator.


Decision Guide

Use this as a starting point, not a rigid rule:

SituationRecommendation
Reference/lookup table with stable codes (ISO, enum)Natural key is fine
Fact table in a star schemaSurrogate key (integer)
SCD Type 2 dimensionSurrogate key — required
Multi-source integration (Data Vault or similar)Hash key
Distributed microservices generating IDsUUID (v4 or v7)
Small config table with < 1000 rowsNatural key is fine
Customer/user entity in a transactional DBSurrogate + natural key stored as separate column

The Worst of Both Worlds: Composite Natural Keys

A composite natural key combines multiple columns to achieve uniqueness — for example, (order_date, order_number). These are the most fragile choice:

  • Every foreign key must carry all columns
  • Changes to any component cascade everywhere
  • Joins become multi-column, which is harder to index and read

If you inherit composite natural keys, replace them with surrogates at the first integration layer. Store the originals as non-key columns.


Practical Recommendation

In most analytical data models: surrogate keys as primary keys, natural keys stored as attributes. This decouples your model from upstream systems and makes SCD tracking straightforward.

The one exception: stable reference tables (ISO currencies, country codes, calendar dates). Natural keys there are actually clearer and safer than arbitrary integers.

If you're auditing an existing model and need to inspect key distribution, uniqueness, or join fan-outs across tables, Harbinger Explorer lets you query your CSV or Parquet exports directly in the browser — useful for quick ad-hoc profiling without a dedicated BI tool.


Wrapping Up

Neither surrogate nor natural keys are universally correct. Natural keys add semantic clarity but tie your model to business logic that can change. Surrogate keys decouple structure from meaning but add indirection. The pattern that scales best: keep both — surrogate as the join key, natural key as a queryable attribute.

The real mistake is treating this as a one-time decision. As your data model evolves, revisit key strategy whenever you add a new source or hit a stability problem in an existing dimension.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...