Change Data Capture Explained
Change Data Capture Explained
Your operational database doesn't care about your data warehouse. Every time a customer updates their profile, cancels an order, or changes an email address, that state change lives in PostgreSQL or MySQL — and your analytics layer is immediately out of date. Change Data Capture (CDC) solves this by detecting and streaming row-level changes from a source database to downstream systems in near-real time.
This guide walks through the three main CDC patterns, how Debezium makes log-based CDC production-ready, and how to wire it all into Kafka Connect.
TL;DR
| Pattern | Latency | Source Impact | Complexity |
|---|---|---|---|
| Log-based (WAL / binlog) | Seconds | Very low | High setup |
| Trigger-based | Seconds | Medium | Medium |
| Query-based (polling) | Minutes | High | Low |
Log-based CDC is the gold standard for production workloads. The others are stepping stones or fallbacks.
The Three CDC Patterns
1. Log-Based CDC
Every major database maintains a transaction log for crash recovery. PostgreSQL calls it the Write-Ahead Log (WAL). MySQL calls it the binary log (binlog). These logs contain a complete, ordered record of every row-level change — including deletes.
Log-based CDC taps into this stream as a secondary consumer, without modifying the source application at all. It reads the transaction log, extracts change events, and forwards them downstream.
Advantages:
- No additional load on source queries
- Captures deletes and updates, not just inserts
- Sub-second latency is achievable
- No application-level changes required
Limitations:
- Requires database-level privileges (replication role)
- Log retention must be configured — logs get truncated
- Schema changes require careful handling downstream
PostgreSQL — enabling logical replication:
-- postgresql.conf: set wal_level = logical (requires DB restart)
-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');
-- Grant replication privilege to the CDC user
ALTER ROLE cdc_user REPLICATION LOGIN;
2. Trigger-Based CDC
Database triggers fire on INSERT, UPDATE, or DELETE and write a copy of the changed row to a dedicated changelog table. A downstream job polls that table and processes new rows.
-- PostgreSQL: trigger-based CDC example
CREATE TABLE customers_changes (
change_id BIGSERIAL PRIMARY KEY,
change_type VARCHAR(10),
changed_at TIMESTAMPTZ DEFAULT NOW(),
old_data JSONB,
new_data JSONB
);
CREATE OR REPLACE FUNCTION capture_customer_changes()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'DELETE' THEN
INSERT INTO customers_changes (change_type, old_data)
VALUES ('DELETE', row_to_json(OLD)::jsonb);
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO customers_changes (change_type, old_data, new_data)
VALUES ('UPDATE', row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
ELSE
INSERT INTO customers_changes (change_type, new_data)
VALUES ('INSERT', row_to_json(NEW)::jsonb);
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_customers_cdc
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_changes();
Trigger-based CDC works, but every write to the source table now incurs trigger overhead. The changelog table also needs regular pruning, and the approach doesn't scale gracefully under high write loads.
3. Query-Based CDC (Polling)
The simplest approach: run a query every N minutes that selects rows modified since the last checkpoint, using an updated_at timestamp column.
-- Query-based CDC: fetch rows changed since last run
-- Standard SQL — works on PostgreSQL, MySQL, and most other RDBMS
SELECT *
FROM customers
WHERE updated_at > '2024-01-15 10:00:00'
ORDER BY updated_at ASC;
The fundamental problem: hard deletes are invisible. If a row is removed, there is no updated_at to track. Soft deletes (a deleted_at column) partially solve this, but require discipline across the entire source application.
Query-based CDC is acceptable for small tables, infrequent updates, or environments where you control the source schema. For high-volume production workloads, it's a compromise.
Debezium: Log-Based CDC That Actually Ships
Debezium is an open-source CDC platform built on Apache Kafka. It reads transaction logs from PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and more — then emits structured change events to Kafka topics.
Each event contains the full before/after state of the row, the operation type (c, u, d, r), and metadata including source table, transaction ID, LSN, and wall-clock timestamp. This is richer and more reliable than anything polling-based can produce.
Debezium PostgreSQL Connector Configuration
{
"name": "customers-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres-host",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "your_password",
"database.dbname": "production_db",
"database.server.name": "prod",
"table.include.list": "public.customers,public.orders",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "debezium_publication",
"topic.prefix": "prod",
"snapshot.mode": "initial"
}
}
This produces Kafka topics named prod.public.customers and prod.public.orders. Every insert, update, and delete on those tables becomes a Kafka message.
The Debezium Event Structure
{
"op": "u",
"before": {
"id": 42,
"email": "old@example.com",
"updated_at": 1705312800000
},
"after": {
"id": 42,
"email": "new@example.com",
"updated_at": 1705316400000
},
"source": {
"version": "2.5.0.Final",
"connector": "postgresql",
"name": "prod",
"ts_ms": 1705316400123,
"db": "production_db",
"schema": "public",
"table": "customers",
"txId": 1234567,
"lsn": 987654321
}
}
op values: c = create (insert), u = update, d = delete, r = read (from initial snapshot).
Deploying with Kafka Connect
Debezium runs as a Kafka Connect source connector. If you already operate Kafka Connect, deploying Debezium is straightforward: add the connector JAR to the Connect plugin path and POST the configuration to the Connect REST API.
# Register the connector via Kafka Connect REST API
curl -X POST http://kafka-connect:8083/connectors \
-H "Content-Type: application/json" \
-d @customers-postgres-connector.json
# Check connector status
curl http://kafka-connect:8083/connectors/customers-postgres-connector/status
From the Kafka topic you can route change events to any sink: object storage (S3, GCS), a data warehouse (Snowflake, BigQuery), or Elasticsearch. The Confluent Hub provides ready-made sink connectors for most targets. [VERIFY: confirm Confluent Hub connector availability for your target system]
Handling Schema Changes
The trickiest long-term challenge is schema evolution. When the source team adds a column, your CDC consumers need to handle both old and new event formats without failing.
Debezium integrates with Confluent Schema Registry (or compatible alternatives like Karapace) to manage Avro, Protobuf, or JSON Schema evolution. With BACKWARD or FULL compatibility mode, adding nullable columns is safe. Dropping or renaming columns is always a breaking change.
Best practice: Treat source schema changes as a deployment event. Use schema registry compatibility checks in your CI pipeline, and coordinate changes with downstream consumers before rolling out. Pairing Debezium with a data contract process closes this loop properly.
CDC vs. Full Loads: When to Use What
| Scenario | CDC | Full Load |
|---|---|---|
| High-volume OLTP tables (>1M changes/day) | ✅ | ❌ Too slow |
| Small reference tables (<10k rows) | Overkill | ✅ Simple |
| Capturing hard deletes | ✅ | ❌ Invisible |
| Source DB has no replication support | ❌ | ✅ |
| Near-real-time latency required (<1 min) | ✅ | ❌ |
| Team has no Kafka infrastructure | ❌ Complex | ✅ |
Common Pitfalls
WAL retention too short. If the Debezium connector is offline for longer than the log retention window, it loses its position and must re-snapshot. Set wal_keep_size (PostgreSQL) generously and monitor replication slot lag continuously.
Not planning the initial snapshot. Before streaming new changes, Debezium takes a consistent snapshot of existing data. On large tables this can run for hours. Plan accordingly — and validate the snapshot before enabling real-time streaming.
Ignoring tombstone events. After emitting a delete event, Debezium publishes a null-value tombstone message to the same key. Some consumers fail silently on this. Handle it explicitly.
Null before-state on inserts. For op: c, the before field is always null. Always check the op field before accessing before.
Compacted topics removing deletes. If you use log-compacted Kafka topics, tombstone messages will eventually disappear. For CDC streams that must replay the full history of changes, use retention-based (not compaction-based) topics.
Exploring CDC Data Once It Lands
Once your CDC stream lands in object storage as Parquet or CSV, you still need to reconstruct current state — applying inserts, updates, and deletes in sequence. Tools like dbt handle this with snapshot models. For ad-hoc validation — "did this specific record change today? What did it look like before?" — you want to query the raw change log quickly without spinning up infrastructure.
Harbinger Explorer lets you upload Parquet or CSV files and query them directly in the browser using DuckDB WASM. Ask in plain English: "show me all rows where op equals u and customer_id equals 42" — the AI generates the SQL against your actual schema. It's a fast way to validate CDC correctness during development.
Conclusion
CDC — specifically log-based CDC via Debezium — is the most reliable way to keep downstream systems synchronized with operational databases in real time. It's not trivial to set up, but for high-volume, latency-sensitive pipelines, it's the right tool. Start with a single table, validate the event format end to end, plan for schema changes, and expand from there.
For structuring the CDC data that lands in your lakehouse, read the Medallion Architecture Explained guide — it shows how raw change events fit into a Bronze → Silver → Gold pipeline.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Data Contracts for Teams
A practical guide to data contracts: schema agreements between producers and consumers, with YAML examples, Schema Registry, and dbt enforcement.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial