Slowly Changing Dimensions Guide
Slowly Changing Dimensions Guide
Dimension tables describe the "who, what, where" of your data — customers, products, employees, locations. The problem is that reality changes: customers move, products get renamed, employees change departments. How you handle those changes in your warehouse determines whether historical reports tell the truth or quietly lie.
Slowly Changing Dimensions (SCDs) are the set of techniques for managing those changes in dimension tables. The concept comes from Ralph Kimball's dimensional modeling methodology and remains the most practical framework for data warehouse design. This guide covers SCD Types 1, 2, 3, and 4 with concrete SQL examples and honest trade-offs for each.
Quick Reference
| Type | Strategy | Historical Data | Storage Cost | Complexity |
|---|---|---|---|---|
| Type 1 | Overwrite | Lost | Low | Low |
| Type 2 | Add a new row | Preserved (full) | High | High |
| Type 3 | Add a column | Preserved (limited) | Medium | Medium |
| Type 4 | History table | Preserved (full) | Medium-High | Medium |
The Setup: A Customer Dimension
Throughout this guide, we'll use a dim_customer table. A customer starts in Berlin, then moves to Munich. Their loyalty tier gets upgraded from Silver to Gold. How does each SCD type handle these changes?
Initial state:
| customer_key | customer_id | name | city | loyalty_tier | updated_at |
|---|---|---|---|---|---|
| 1 | C-100 | Anna Müller | Berlin | Silver | 2024-01-10 09:00:00 |
SCD Type 1: Overwrite
Type 1 simply overwrites the old value with the new one. No history is kept. The dimension always reflects current state.
-- PostgreSQL: SCD Type 1 — overwrite on change
UPDATE dim_customer
SET
city = 'Munich',
loyalty_tier = 'Gold',
updated_at = NOW()
WHERE customer_id = 'C-100';
After the update:
| customer_key | customer_id | name | city | loyalty_tier | updated_at |
|---|---|---|---|---|---|
| 1 | C-100 | Anna Müller | Munich | Gold | 2024-03-15 14:30:00 |
The Berlin record is gone. Any historical fact table rows that joined to this dimension will now incorrectly show Anna as a Munich customer for her entire history — including orders placed when she was in Berlin.
When to use Type 1:
- The old value is genuinely meaningless (e.g., correcting a data entry error)
- Analysts have explicitly agreed that historical accuracy for this attribute doesn't matter
- Storage is very constrained
When not to use Type 1:
- For any attribute where "what was true at the time of the transaction" matters
- For compliance-sensitive data that requires an audit trail
SCD Type 2: Add a New Row
Type 2 is the workhorse of dimensional modeling. Instead of overwriting, you expire the old row and insert a new row for the updated state. Each row gets effective and expiry date columns plus a boolean is_current flag.
-- PostgreSQL: SCD Type 2 — expire old row, insert new row
-- Step 1: Expire the current row
UPDATE dim_customer
SET
valid_to = '2024-03-15',
is_current = FALSE
WHERE customer_id = 'C-100'
AND is_current = TRUE;
-- Step 2: Insert the new current row
INSERT INTO dim_customer
(customer_key, customer_id, name, city, loyalty_tier,
valid_from, valid_to, is_current, updated_at)
VALUES
(DEFAULT, 'C-100', 'Anna Müller', 'Munich', 'Gold',
'2024-03-15', '9999-12-31', TRUE, NOW());
After the update:
| customer_key | customer_id | name | city | loyalty_tier | valid_from | valid_to | is_current |
|---|---|---|---|---|---|---|---|
| 1 | C-100 | Anna Müller | Berlin | Silver | 2024-01-10 | 2024-03-15 | FALSE |
| 2 | C-100 | Anna Müller | Munich | Gold | 2024-03-15 | 9999-12-31 | TRUE |
Now a fact table row from January correctly joins to customer_key = 1 (Berlin, Silver). A fact row from April correctly joins to customer_key = 2 (Munich, Gold). History is preserved exactly.
Querying SCD Type 2
For current-state queries, filter on is_current = TRUE. For point-in-time queries, join on the date range:
-- PostgreSQL: point-in-time join against SCD Type 2 dimension
SELECT
f.order_id,
f.order_date,
f.amount_usd,
c.city AS customer_city_at_time,
c.loyalty_tier AS loyalty_tier_at_time
FROM fact_orders f
JOIN dim_customer c
ON f.customer_id = c.customer_id
AND f.order_date BETWEEN c.valid_from AND c.valid_to
WHERE f.order_date >= '2024-01-01';
When to use Type 2:
- Attributes where historical accuracy matters (geography, tier, status)
- Regulatory or compliance requirements for audit trails
- Customer journey and cohort analyses
Trade-offs:
- Dimension table grows with every change — at scale this can be substantial
- Queries must account for multiple rows per entity
- ETL logic is more complex to implement correctly
- Fact table foreign keys point to surrogate keys, not natural keys
Type 2 in dbt: Snapshots
dbt has first-class support for SCD Type 2 via its snapshot feature. Define the strategy in a snapshot file and dbt handles the expire/insert logic automatically:
# dbt snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
SELECT
customer_id,
name,
city,
loyalty_tier,
updated_at
FROM {{ source('raw', 'customers') }}
{% endsnapshot %}
Run dbt snapshot and dbt generates the valid_from, valid_to, dbt_scd_id, and dbt_is_current columns automatically. This eliminates the most error-prone parts of SCD Type 2 implementation.
SCD Type 3: Add a Column
Type 3 preserves a limited amount of history by adding a "previous value" column alongside the current value. It's a middle ground — you get one generation of history without the complexity of multiple rows.
-- PostgreSQL: SCD Type 3 — add previous_city column
ALTER TABLE dim_customer
ADD COLUMN previous_city VARCHAR(100),
ADD COLUMN city_changed_at TIMESTAMPTZ;
-- Update: shift current to previous, set new current
UPDATE dim_customer
SET
previous_city = city,
city = 'Munich',
city_changed_at = NOW(),
updated_at = NOW()
WHERE customer_id = 'C-100'
AND is_current = TRUE;
After the update:
| customer_key | customer_id | name | city | previous_city | loyalty_tier |
|---|---|---|---|---|---|
| 1 | C-100 | Anna Müller | Munich | Berlin | Gold |
When to use Type 3:
- You need to support "compare current vs. previous" reporting without full history
- The attribute changes rarely and analysts care only about one historical value
- You want to avoid the complexity of Type 2
When not to use Type 3:
- When attributes change more than twice (the pattern breaks down — you'd need
previous_previous_city) - When point-in-time accuracy across arbitrary dates is required
Type 3 is underused in practice. It solves a specific reporting pattern cleanly, but it doesn't generalize.
SCD Type 4: History Table
Type 4 separates current state from history entirely. The main dimension table holds only the current row (like Type 1), while a separate history table captures every previous version.
-- PostgreSQL: SCD Type 4 — separate history table
-- Current state table (always one row per entity)
CREATE TABLE dim_customer_current (
customer_key BIGSERIAL PRIMARY KEY,
customer_id VARCHAR(20) NOT NULL UNIQUE,
name VARCHAR(200),
city VARCHAR(100),
loyalty_tier VARCHAR(20),
updated_at TIMESTAMPTZ
);
-- History table (one row per historical version)
CREATE TABLE dim_customer_history (
history_key BIGSERIAL PRIMARY KEY,
customer_id VARCHAR(20) NOT NULL,
name VARCHAR(200),
city VARCHAR(100),
loyalty_tier VARCHAR(20),
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- On change: archive current to history, then update current
BEGIN;
INSERT INTO dim_customer_history
(customer_id, name, city, loyalty_tier, valid_from, valid_to)
SELECT customer_id, name, city, loyalty_tier, updated_at, NOW()
FROM dim_customer_current
WHERE customer_id = 'C-100';
UPDATE dim_customer_current
SET city = 'Munich', loyalty_tier = 'Gold', updated_at = NOW()
WHERE customer_id = 'C-100';
COMMIT;
Advantages of Type 4:
- The current-state table stays small and fast for the common case
- History is isolated and can be queried or archived independently
- Query performance for non-historical use cases is much better than Type 2
Disadvantages:
- Queries needing both current and historical data require joins across two tables
- More complex ETL — changes must be coordinated across two tables atomically
When to use Type 4:
- Very large dimensions where Type 2 would cause significant performance issues
- When most queries only need current state and historical queries are rare
- When you want clear physical separation between operational and historical data
Choosing the Right Type
This decision framework covers most real-world cases:
- "Was the old value simply wrong?" → Type 1 (error correction)
- "Does historical accuracy matter for this attribute?" → Type 2
- "Do we only need to compare current vs. one prior state?" → Type 3
- "Is the dimension huge and most queries need only current data?" → Type 4
- "Do we need both performance and full history?" → Type 4 (hybrid: current table for fast queries, history table for point-in-time)
In practice, most dimension tables use Type 2 for slowly changing attributes like geography, status, and category — and Type 1 for attributes where corrections are the norm and history is irrelevant.
What About Type 0, 5, 6, and 7?
Kimball defined additional types, though they see limited real-world adoption:
- Type 0: Never changes (fixed attributes like date of birth)
- Type 5: Type 1 + a mini-dimension for high-cardinality attributes
- Type 6: Combined Type 1 + 2 + 3 (current value in both the current row and a "current_X" column on historical rows)
- Type 7: Dual foreign keys — one to Type 1 (current) and one to Type 2 (historical) — allowing both query styles against the same fact table
Types 6 and 7 are worth knowing if you're working with complex dimensional models, but for most teams Type 2 (with dbt snapshots) covers 90% of use cases.
Exploring SCDs Without a Warehouse
If you're working with SCD data that lands in flat files — exported snapshots, dbt snapshot CSV outputs, or archive dumps — Harbinger Explorer lets you query them directly in the browser with DuckDB WASM. You can upload multiple CSV files representing different snapshot dates and run SQL joins across them to reconstruct point-in-time views, without loading a full warehouse environment.
Conclusion
SCD Type 2 is the default for most dimension attributes where history matters. It preserves point-in-time accuracy, pairs cleanly with dbt snapshots, and gives analysts the full picture for cohort and journey analysis. Type 1 is right for corrections. Types 3 and 4 solve specific edge cases. The key is making a deliberate choice per attribute — not defaulting to overwrite because it's simpler.
For the broader context of where dimension tables fit in your data architecture, read Medallion Architecture Explained and Data Lakehouse Architecture Explained.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial