Event-Driven Data Architecture with Kafka and CQRS
Batch jobs run at midnight and produce data that's 12 hours stale by the time anyone reads it. Meanwhile, your customers are making decisions — buying, churning, clicking — in real time. Event-driven data architecture is the pattern that bridges this gap by making data flow as events happen, not on a schedule.
What Is Event-Driven Data Architecture?
In an event-driven architecture, every meaningful state change in your system is recorded as an immutable event: OrderPlaced, UserSignedUp, PaymentFailed, InventoryUpdated. These events flow through a message broker (most commonly Apache Kafka) and are consumed by multiple downstream systems independently.
For data platforms specifically, this means:
- Your warehouse can consume events and update tables in near-real-time
- ML models can react to events without polling a database
- Operational systems and analytics systems share the same event stream
- Historical state can be reconstructed by replaying the event log
Application Services
│ emits events
▼
Apache Kafka (Event Log)
│
┌────┴─────────────────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
Warehouse ML Feature Search Notification
(Flink/ Store Index Service
Spark) (Redis/Feast) (ES) (SendGrid)
Core Concepts
Event Sourcing
Event Sourcing means your application's state is derived from an append-only log of events — not from a mutable database table. Instead of updating a row (UPDATE orders SET status = 'shipped'), you append an event (OrderShipped { order_id, shipped_at, carrier }).
The current state of any entity is computed by replaying all events for that entity.
Benefits for data teams:
- Perfect audit trail — every state change is recorded with timestamp and actor
- Time travel is natural — replay events up to any point in time
- Schema evolution is explicit — new event types don't break old consumers
- Rebuilding derived tables from scratch is always possible
The cost: Replaying millions of events to get current state is slow. You need snapshots (periodic state captures) to make reads practical.
CQRS — Command Query Responsibility Segregation
CQRS separates the write path (commands that change state, emitting events) from the read path (queries that read from pre-built read models).
Command Side Query Side
(Write) (Read)
User Action User Query
│ │
▼ ▼
Command Handler Read Model / View
│ (denormalized table,
│ emits event optimized for query)
▼ ▲
Event Store (Kafka) │
│ │
└────────── Projector ──────────┘
(builds and updates
read models from events)
For data platforms, CQRS maps naturally to the Medallion Architecture:
- Bronze layer = raw event log (event store)
- Silver layer = cleansed, joined projections (read models)
- Gold layer = business-ready aggregations (denormalized views)
Apache Kafka as the Backbone
Kafka is the most common event broker for data platforms at scale. Key properties that make it useful:
| Property | What It Means for Data Teams |
|---|---|
| Persistent log | Consumers can replay from any offset, enabling backfill |
| Consumer groups | Multiple teams can consume the same event independently |
| Partitioning | Parallelism by key (e.g., user_id) for ordered processing |
| Retention | Configure retention by time or size — days to forever |
| Schema Registry | Enforce Avro/Protobuf schemas on producers |
| Exactly-once semantics | Configurable guarantees for critical financial pipelines |
Building an Event-Driven Data Pipeline
1. Event Schema Design
Design events with data consumers in mind. Bad event schemas are the #1 source of pain in event-driven systems.
// Good: self-describing, versioned, no domain-specific abbreviations
{
"event_type": "order.placed",
"event_id": "evt_01HXK3M7N8P2Q4R5T6V7W8X9Y0",
"schema_version": "2.1",
"occurred_at": "2026-04-03T09:15:00Z",
"producer": "order-service",
"payload": {
"order_id": "ord_abc123",
"customer_id": "cust_xyz789",
"items": [
{"sku": "WIDGET-42", "quantity": 2, "unit_price_cents": 1999}
],
"total_cents": 3998,
"currency": "EUR"
}
}
// Bad: abbreviated, no version, no schema info
{
"t": "ord_plcd",
"oid": "abc123",
"cid": "xyz",
"amt": 39.98
}
2. Kafka Producer (Python)
# Python — Kafka producer with Avro schema
from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka.serialization import SerializationContext, MessageField
import json
from datetime import datetime, timezone
KAFKA_BOOTSTRAP = "kafka:9092"
SCHEMA_REGISTRY_URL = "http://schema-registry:8081"
ORDER_PLACED_SCHEMA = '''
{
"type": "record",
"name": "OrderPlaced",
"namespace": "com.harbinger.orders",
"fields": [
{"name": "event_id", "type": "string"},
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "total_cents", "type": "long"},
{"name": "occurred_at", "type": "string"}
]
}
'''
schema_registry = SchemaRegistryClient({"url": SCHEMA_REGISTRY_URL})
avro_serializer = AvroSerializer(schema_registry, ORDER_PLACED_SCHEMA)
producer = Producer({"bootstrap.servers": KAFKA_BOOTSTRAP})
def publish_order_placed(order_id: str, customer_id: str, total_cents: int):
event = {
"event_id": f"evt_{order_id}_{int(datetime.now().timestamp())}",
"order_id": order_id,
"customer_id": customer_id,
"total_cents": total_cents,
"occurred_at": datetime.now(timezone.utc).isoformat(),
}
producer.produce(
topic="orders.placed",
key=customer_id, # partition by customer for ordering
value=avro_serializer(event, SerializationContext("orders.placed", MessageField.VALUE)),
)
producer.flush()
3. Stream Processing with Flink / Spark Structured Streaming
# PySpark Structured Streaming — consume Kafka, write to Delta Lake
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, to_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType
spark = SparkSession.builder .appName("OrderEventsConsumer") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
event_schema = StructType([
StructField("event_id", StringType()),
StructField("order_id", StringType()),
StructField("customer_id", StringType()),
StructField("total_cents", LongType()),
StructField("occurred_at", StringType()),
])
raw_stream = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "orders.placed") .option("startingOffsets", "latest") .load()
parsed = raw_stream .select(from_json(col("value").cast("string"), event_schema).alias("data")) .select("data.*") .withColumn("occurred_at", to_timestamp("occurred_at"))
query = parsed.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "s3://checkpoints/orders-placed/") .start("s3://lakehouse/silver/orders_placed/")
query.awaitTermination()
Event-Driven Patterns for Data Platforms
Outbox Pattern
The Outbox pattern solves a common reliability problem: you write to your database and want to publish an event to Kafka atomically. If you write to both independently, one can succeed while the other fails.
Solution: Write the event to an outbox table in the same database transaction as your business write. A separate process (Debezium + Change Data Capture) reads the outbox and publishes to Kafka.
Event-Driven Feature Store
ML features derived from streaming events:
UserSignedUpevent → Kafka → Flink aggregates 7-day session counts → Redis feature store- Model inference reads from Redis (low-latency, fresh features)
- Same events land in Delta Lake for batch training
This pattern avoids training/serving skew — both paths read from the same event source.
Saga Pattern for Distributed Transactions
When an operation spans multiple services (create order → reserve inventory → charge payment), a Saga coordinates via events rather than a distributed transaction:
- Each service publishes a success or failure event
- Compensating events roll back previous steps on failure
- The data platform observes all saga events for a complete audit trail
Trade-Offs to Be Honest About
Event-driven is not universally better. The operational complexity is real.
| Consideration | Event-Driven | Batch |
|---|---|---|
| Latency | Seconds | Minutes-hours |
| Complexity | High | Low |
| Debugging | Hard (distributed) | Easier (logs) |
| Ordering guarantees | Per-partition only | Natural |
| Exactly-once | Configurable, hard | Easier |
| Schema changes | Requires coordination | Column add is easy |
| Team skill required | Kafka expertise | SQL/Python |
| Cost | Kafka cluster overhead | Compute on schedule |
Start with batch if your team is small and your business doesn't need real-time. Adding event-driven complexity for a batch use case is a form of over-engineering. See Streaming vs Batch Processing for the decision framework.
When to Use Event-Driven Architecture for Data
Strong signals:
- You need sub-minute data freshness in dashboards or operational systems
- Multiple systems need the same events independently (no tight coupling)
- You're building ML features that need fresh behavioral signals
- Audit trail and replay capability are requirements
Weak signals (consider batch instead):
- Daily reporting is sufficient
- Single downstream consumer
- Team has no Kafka experience
- Data volumes are small enough that scheduled queries work fine
Wrapping Up
Event-driven data architecture is powerful when freshness matters and multiple consumers need the same data stream. Event Sourcing gives you a complete, replayable history. CQRS separates write optimization from read optimization. Kafka provides the durable, scalable backbone.
The pattern has real costs: Kafka clusters, schema management, exactly-once semantics, and distributed debugging. Adopt it where the business genuinely needs real-time, not because it sounds modern.
Next step: Identify one batch pipeline where the 24-hour delay causes a business problem. That's your first candidate for an event-driven migration.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial