Engineering

Published: Apr 3, 2026

Event-Driven Data Architecture with Kafka and CQRS

11 min read·Tags: event-driven, kafka, event-sourcing, cqrs, streaming, data-architecture, real-time

Batch jobs run at midnight and produce data that's 12 hours stale by the time anyone reads it. Meanwhile, your customers are making decisions — buying, churning, clicking — in real time. Event-driven data architecture is the pattern that bridges this gap by making data flow as events happen, not on a schedule.

What Is Event-Driven Data Architecture?

In an event-driven architecture, every meaningful state change in your system is recorded as an immutable event: OrderPlaced, UserSignedUp, PaymentFailed, InventoryUpdated. These events flow through a message broker (most commonly Apache Kafka) and are consumed by multiple downstream systems independently.

For data platforms specifically, this means:

Your warehouse can consume events and update tables in near-real-time
ML models can react to events without polling a database
Operational systems and analytics systems share the same event stream
Historical state can be reconstructed by replaying the event log

Application Services
       │ emits events
       ▼
  Apache Kafka (Event Log)
       │
  ┌────┴─────────────────────────────────┐
  │             │                │       │
  ▼             ▼                ▼       ▼
Warehouse   ML Feature        Search   Notification
(Flink/     Store             Index    Service
 Spark)     (Redis/Feast)     (ES)     (SendGrid)

Core Concepts

Event Sourcing

Event Sourcing means your application's state is derived from an append-only log of events — not from a mutable database table. Instead of updating a row (UPDATE orders SET status = 'shipped'), you append an event (OrderShipped { order_id, shipped_at, carrier }).

The current state of any entity is computed by replaying all events for that entity.

Benefits for data teams:

Perfect audit trail — every state change is recorded with timestamp and actor
Time travel is natural — replay events up to any point in time
Schema evolution is explicit — new event types don't break old consumers
Rebuilding derived tables from scratch is always possible

The cost: Replaying millions of events to get current state is slow. You need snapshots (periodic state captures) to make reads practical.

CQRS — Command Query Responsibility Segregation

CQRS separates the write path (commands that change state, emitting events) from the read path (queries that read from pre-built read models).

Command Side                    Query Side
(Write)                         (Read)

User Action                     User Query
    │                               │
    ▼                               ▼
Command Handler              Read Model / View
    │                         (denormalized table,
    │ emits event              optimized for query)
    ▼                               ▲
Event Store (Kafka)                 │
    │                               │
    └────────── Projector ──────────┘
                (builds and updates
                 read models from events)

For data platforms, CQRS maps naturally to the Medallion Architecture:

Bronze layer = raw event log (event store)
Silver layer = cleansed, joined projections (read models)
Gold layer = business-ready aggregations (denormalized views)

Apache Kafka as the Backbone

Kafka is the most common event broker for data platforms at scale. Key properties that make it useful:

Property	What It Means for Data Teams
Persistent log	Consumers can replay from any offset, enabling backfill
Consumer groups	Multiple teams can consume the same event independently
Partitioning	Parallelism by key (e.g., `user_id`) for ordered processing
Retention	Configure retention by time or size — days to forever
Schema Registry	Enforce Avro/Protobuf schemas on producers
Exactly-once semantics	Configurable guarantees for critical financial pipelines

Building an Event-Driven Data Pipeline

1. Event Schema Design

Design events with data consumers in mind. Bad event schemas are the #1 source of pain in event-driven systems.

// Good: self-describing, versioned, no domain-specific abbreviations
{
  "event_type": "order.placed",
  "event_id": "evt_01HXK3M7N8P2Q4R5T6V7W8X9Y0",
  "schema_version": "2.1",
  "occurred_at": "2026-04-03T09:15:00Z",
  "producer": "order-service",
  "payload": {
    "order_id": "ord_abc123",
    "customer_id": "cust_xyz789",
    "items": [
      {"sku": "WIDGET-42", "quantity": 2, "unit_price_cents": 1999}
    ],
    "total_cents": 3998,
    "currency": "EUR"
  }
}

// Bad: abbreviated, no version, no schema info
{
  "t": "ord_plcd",
  "oid": "abc123",
  "cid": "xyz",
  "amt": 39.98
}

2. Kafka Producer (Python)

# Python — Kafka producer with Avro schema
from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka.serialization import SerializationContext, MessageField
import json
from datetime import datetime, timezone

KAFKA_BOOTSTRAP = "kafka:9092"
SCHEMA_REGISTRY_URL = "http://schema-registry:8081"

ORDER_PLACED_SCHEMA = '''
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.harbinger.orders",
  "fields": [
    {"name": "event_id", "type": "string"},
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "occurred_at", "type": "string"}
  ]
}
'''

schema_registry = SchemaRegistryClient({"url": SCHEMA_REGISTRY_URL})
avro_serializer = AvroSerializer(schema_registry, ORDER_PLACED_SCHEMA)
producer = Producer({"bootstrap.servers": KAFKA_BOOTSTRAP})

def publish_order_placed(order_id: str, customer_id: str, total_cents: int):
    event = {
        "event_id": f"evt_{order_id}_{int(datetime.now().timestamp())}",
        "order_id": order_id,
        "customer_id": customer_id,
        "total_cents": total_cents,
        "occurred_at": datetime.now(timezone.utc).isoformat(),
    }
    producer.produce(
        topic="orders.placed",
        key=customer_id,          # partition by customer for ordering
        value=avro_serializer(event, SerializationContext("orders.placed", MessageField.VALUE)),
    )
    producer.flush()

3. Stream Processing with Flink / Spark Structured Streaming

# PySpark Structured Streaming — consume Kafka, write to Delta Lake
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, to_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType

spark = SparkSession.builder     .appName("OrderEventsConsumer")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

event_schema = StructType([
    StructField("event_id", StringType()),
    StructField("order_id", StringType()),
    StructField("customer_id", StringType()),
    StructField("total_cents", LongType()),
    StructField("occurred_at", StringType()),
])

raw_stream = spark.readStream     .format("kafka")     .option("kafka.bootstrap.servers", "kafka:9092")     .option("subscribe", "orders.placed")     .option("startingOffsets", "latest")     .load()

parsed = raw_stream     .select(from_json(col("value").cast("string"), event_schema).alias("data"))     .select("data.*")     .withColumn("occurred_at", to_timestamp("occurred_at"))

query = parsed.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "s3://checkpoints/orders-placed/")     .start("s3://lakehouse/silver/orders_placed/")

query.awaitTermination()

Event-Driven Patterns for Data Platforms

Outbox Pattern

The Outbox pattern solves a common reliability problem: you write to your database and want to publish an event to Kafka atomically. If you write to both independently, one can succeed while the other fails.

Solution: Write the event to an outbox table in the same database transaction as your business write. A separate process (Debezium + Change Data Capture) reads the outbox and publishes to Kafka.

Event-Driven Feature Store

ML features derived from streaming events:

UserSignedUp event → Kafka → Flink aggregates 7-day session counts → Redis feature store
Model inference reads from Redis (low-latency, fresh features)
Same events land in Delta Lake for batch training

This pattern avoids training/serving skew — both paths read from the same event source.

Saga Pattern for Distributed Transactions

When an operation spans multiple services (create order → reserve inventory → charge payment), a Saga coordinates via events rather than a distributed transaction:

Each service publishes a success or failure event
Compensating events roll back previous steps on failure
The data platform observes all saga events for a complete audit trail

Trade-Offs to Be Honest About

Event-driven is not universally better. The operational complexity is real.

Consideration	Event-Driven	Batch
Latency	Seconds	Minutes-hours
Complexity	High	Low
Debugging	Hard (distributed)	Easier (logs)
Ordering guarantees	Per-partition only	Natural
Exactly-once	Configurable, hard	Easier
Schema changes	Requires coordination	Column add is easy
Team skill required	Kafka expertise	SQL/Python
Cost	Kafka cluster overhead	Compute on schedule

Start with batch if your team is small and your business doesn't need real-time. Adding event-driven complexity for a batch use case is a form of over-engineering. See Streaming vs Batch Processing for the decision framework.

When to Use Event-Driven Architecture for Data

Strong signals:

You need sub-minute data freshness in dashboards or operational systems
Multiple systems need the same events independently (no tight coupling)
You're building ML features that need fresh behavioral signals
Audit trail and replay capability are requirements

Weak signals (consider batch instead):

Daily reporting is sufficient
Single downstream consumer
Team has no Kafka experience
Data volumes are small enough that scheduled queries work fine

Wrapping Up

Event-driven data architecture is powerful when freshness matters and multiple consumers need the same data stream. Event Sourcing gives you a complete, replayable history. CQRS separates write optimization from read optimization. Kafka provides the durable, scalable backbone.

The pattern has real costs: Kafka clusters, schema management, exactly-once semantics, and distributed debugging. Adopt it where the business genuinely needs real-time, not because it sounds modern.

Next step: Identify one batch pipeline where the 24-hour delay causes a business problem. That's your first candidate for an event-driven migration.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Event-Driven Data Architecture with Kafka and CQRS

What Is Event-Driven Data Architecture?

Core Concepts

Event Sourcing

CQRS — Command Query Responsibility Segregation

Apache Kafka as the Backbone

Building an Event-Driven Data Pipeline

1. Event Schema Design

2. Kafka Producer (Python)

3. Stream Processing with Flink / Spark Structured Streaming

Event-Driven Patterns for Data Platforms

Outbox Pattern

Event-Driven Feature Store

Saga Pattern for Distributed Transactions

Trade-Offs to Be Honest About

When to Use Event-Driven Architecture for Data

Wrapping Up

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

What Is Event-Driven Data Architecture?

Core Concepts

Event Sourcing

CQRS — Command Query Responsibility Segregation

Apache Kafka as the Backbone

Building an Event-Driven Data Pipeline

1. Event Schema Design

2. Kafka Producer (Python)

3. Stream Processing with Flink / Spark Structured Streaming

Event-Driven Patterns for Data Platforms

Outbox Pattern

Event-Driven Feature Store

Saga Pattern for Distributed Transactions

Trade-Offs to Be Honest About

When to Use Event-Driven Architecture for Data

Wrapping Up

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Command Palette