Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Streaming Tables: DLT vs Structured Streaming

9 min read·Tags: databricks, streaming tables, delta live tables, DLT, structured streaming, kafka, real-time

Most Databricks teams start with raw Structured Streaming — spark.readStream, writeStream, checkpoints. It works. Then Delta Live Tables arrives and suddenly you have two ways to do the same thing. Understanding when each is the right tool saves you from painful mid-project rewrites.

Two Ways to Build Streaming Pipelines

Databricks gives you two first-class options for streaming:

  1. Structured Streaming — the underlying Spark API, full control, maximum flexibility
  2. DLT Streaming Tables — a declarative abstraction on top of Structured Streaming, opinionated and managed

They're not mutually exclusive — DLT Streaming Tables are Structured Streaming under the hood. The question is how much of the plumbing you want to own.

Structured Streaming: Full Control

Structured Streaming processes data as a continuous sequence of micro-batches. You define a source, a transformation, and a sink. Databricks handles the execution.

# PySpark — Basic Structured Streaming pipeline
from pyspark.sql.functions import col, from_json, schema_of_json

# Read from Event Hub (Kafka protocol)
raw = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
    .option("subscribe", "sensor-events")
    .option("startingOffsets", "latest")
    .load()
)

# Parse JSON payload
parsed = (
    raw
    .selectExpr("CAST(value AS STRING) as json_payload", "timestamp as kafka_ts")
    .withColumn("data", from_json(col("json_payload"), "sensor_id STRING, temp DOUBLE, ts LONG"))
    .select("kafka_ts", "data.*")
)

# Write to Delta
(
    parsed.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/checkpoints/sensors_bronze")
    .trigger(processingTime="30 seconds")
    .table("bronze.sensor_events")
)

You own the checkpoint, the trigger interval, the retry logic, and the monitoring. Maximum control, maximum responsibility.

DLT Streaming Tables: Declarative Pipelines

Delta Live Tables introduces a declarative SQL or Python API where you define what you want, not how to run it. DLT manages execution, retries, schema enforcement, and data quality.

# Python DLT — Streaming Table definition
import dlt
from pyspark.sql.functions import col, from_json

@dlt.table(
    name="sensor_events_bronze",
    comment="Raw sensor events from Event Hub",
    table_properties={"quality": "bronze"}
)
def sensor_events_bronze():
    return (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
        .option("subscribe", "sensor-events")
        .option("startingOffsets", "latest")
        .load()
        .selectExpr("CAST(value AS STRING) as raw_payload", "timestamp as kafka_ts")
    )

@dlt.table(name="sensor_events_silver")
@dlt.expect_or_drop("valid_temp", "temp BETWEEN -50 AND 150")
def sensor_events_silver():
    return (
        dlt.read_stream("sensor_events_bronze")
        .withColumn("data", from_json(col("raw_payload"), "sensor_id STRING, temp DOUBLE"))
        .select("kafka_ts", "data.sensor_id", "data.temp")
    )

DLT handles checkpoints automatically. The @dlt.expect_or_drop decorator enforces data quality — rows failing the expectation are dropped (or quarantined, depending on the annotation). No checkpoint management, no manual recovery.

Streaming Tables in SQL

DLT also supports SQL-first Streaming Tables, which became GA in late 2024:

-- DLT SQL — Streaming Table
CREATE OR REFRESH STREAMING TABLE sensor_events_silver
COMMENT 'Validated sensor readings'
AS SELECT
    kafka_ts,
    raw_payload:sensor_id::STRING AS sensor_id,
    raw_payload:temp::DOUBLE AS temp
FROM STREAM(LIVE.sensor_events_bronze)
WHERE raw_payload:temp::DOUBLE BETWEEN -50 AND 150;

The STREAM() wrapper tells DLT this is a streaming read. The LIVE. prefix references another DLT table in the same pipeline.

Side-by-Side Comparison

DimensionStructured StreamingDLT Streaming Tables
API styleImperative (Python/Scala)Declarative (Python or SQL)
Checkpoint managementManualAutomatic
Data qualityCustom validation logicBuilt-in expectations framework
Pipeline orchestrationExternal (Workflows, Airflow)Built-in DLT pipeline
DebuggingSpark UI, logsDLT pipeline UI with lineage
Schema enforcementManual mergeSchemaAutomatic with evolution
CostJob clusterDLT cluster (DBU surcharge)
Custom sourcesAny Spark sourceAny Spark source
Stateful operationsFull supportSupported, but complex
Multi-hop (bronze→silver→gold)Manual chainingNative table references

DLT DBU Surcharge: The Real Cost

DLT pipelines run on Enhanced or Core tier, both with a DBU multiplier on top of the base Databricks cost:

TierDBU MultiplierWhen to use
Core0.2x extraNon-critical pipelines, dev
Pro0.25x extraExpectations, autoscaling
Advanced0.3x extraRow-level security, enhanced monitoring

(Last verified: April 2026 — [PRICING-CHECK])

For cost-sensitive pipelines with simple transformations, raw Structured Streaming on a standard job cluster can be significantly cheaper. Run the numbers before committing to DLT for high-volume, always-on pipelines.

Stateful Streaming: Watermarks and Windows

Both approaches support stateful aggregations — but you need to be explicit about watermarks or your state store will grow unbounded.

# PySpark — Windowed aggregation with watermark
from pyspark.sql.functions import window, avg

aggregated = (
    parsed
    .withWatermark("event_time", "10 minutes")   # tolerate 10-min late data
    .groupBy(
        window(col("event_time"), "5 minutes"),  # 5-min tumbling window
        col("sensor_id")
    )
    .agg(avg("temp").alias("avg_temp"))
)

In DLT, stateful aggregations are supported but require more care — use dlt.read_stream() with explicit watermarks and avoid unbounded state by always specifying windows.

Common Pitfalls

1. Missing watermarks on stateful aggregations Without a watermark, Spark holds all state forever. Memory pressure and eventually OOM errors follow.

2. Wrong trigger interval trigger(once=True) processes all available data and stops — it's a micro-batch job, not continuous streaming. Use trigger(availableNow=True) for the modern equivalent in Databricks Runtime 11.3+.

3. DLT for one-off batch loads DLT has startup overhead. For simple one-time ingestion, a regular notebook job is faster and cheaper.

4. Sharing checkpoints between runs Never reuse a checkpoint directory for a different stream definition. The checkpoint encodes the query plan — mismatches cause failures or silent data loss.

5. Ignoring _rescued_data from Autoloader sources If you feed Databricks Autoloader into a DLT pipeline, make sure your silver table handles the _rescued_data column explicitly.

When to Choose Which

Use DLT Streaming Tables when:

  • You're building a medallion architecture with multiple hops
  • Data quality expectations are a first-class requirement
  • Your team includes SQL-first engineers
  • You want built-in lineage and pipeline observability
  • You're already using DLT for batch tables

Use raw Structured Streaming when:

  • You need complex stateful logic (custom flatMapGroupsWithState)
  • Cost is a primary constraint and DLT DBU surcharge matters
  • You need non-Delta sinks (Cassandra, Elasticsearch, custom)
  • You're integrating with an existing Airflow or Databricks Workflows orchestration layer

Key Takeaways

DLT Streaming Tables are the right default for new medallion pipelines — they remove operational overhead and add data quality enforcement. Raw Structured Streaming remains the right choice when you need full control over stateful logic, custom sinks, or when DLT's cost overhead doesn't justify the convenience.

The two approaches aren't rivals — many production architectures use raw Structured Streaming for high-volume event ingestion at the edge and DLT for the bronze-to-gold transformation layer.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...