Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Autoloader: The Complete Guide

8 min read·Tags: databricks, autoloader, cloud storage, schema inference, structured streaming, bronze layer, delta lake

Ingesting files from S3, ADLS, or GCS sounds simple until you're dealing with schema drift, duplicate processing, and terabytes landing every hour. Databricks Autoloader is the answer most teams reach for — but its defaults hide some sharp edges worth knowing before you go to production.

What Is Databricks Autoloader?

Databricks Autoloader (cloudFiles) is a structured streaming source that incrementally ingests new files from cloud object storage. Instead of scanning the entire storage prefix on every run, it tracks which files have been processed using either file notification (event-driven, via cloud messaging) or directory listing (periodic scan). For most production use-cases, file notification mode is faster and cheaper.

# PySpark — Autoloader with file notification (ADLS Gen2 example)
df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/mnt/checkpoints/landing_schema")
    .option("cloudFiles.useNotifications", "true")   # file notification mode
    .load("abfss://raw@storageaccount.dfs.core.windows.net/events/")
)

(
    df.writeStream
    .format("delta")
    .option("checkpointLocation", "/mnt/checkpoints/landing_checkpoint")
    .option("mergeSchema", "true")
    .outputMode("append")
    .table("bronze.raw_events")
)

The schemaLocation is not optional in practice — it's where Autoloader persists the inferred schema between runs. Without it, every restart re-infers from scratch.

Schema Inference and Evolution

Autoloader can infer schema from your files automatically. For JSON and CSV, it samples the first cloudFiles.inferColumnTypes rows (default: 10,000) and writes the inferred schema to the schemaLocation. On subsequent runs it reads from there instead of re-inferring.

# Schema inference options
.option("cloudFiles.inferColumnTypes", "true")   # infer int/long vs string
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # | rescue | failOnNewColumns | none

Schema evolution modes — when to use which:

ModeBehaviorBest for
addNewColumnsNew columns added to schema, old data has nullsAppend-heavy pipelines
rescueUnknown columns written to _rescued_data JSON columnWhen schema changes are expected but data must not be lost
failOnNewColumnsStream fails on any new columnStrict bronze → silver SLAs
noneNo evolution, unknown columns dropped silentlyKnown stable schemas

For most bronze ingestion layers, rescue is the right default — you never lose data, and you can inspect _rescued_data in silver to handle edge cases.

Rescue Data: Your Safety Net

The _rescued_data column is Autoloader's most underappreciated feature. Any field that doesn't match the current schema lands here as a JSON string — including malformed values, unexpected types, and genuinely new fields.

# Spark SQL — inspect rescued data in bronze table
SELECT
    event_id,
    event_time,
    _rescued_data,
    _metadata.file_path AS source_file,
    _metadata.file_modification_time AS file_ts
FROM bronze.raw_events
WHERE _rescued_data IS NOT NULL
LIMIT 100;

The _metadata column (automatically added by Autoloader) gives you the source file path, modification time, and size — invaluable for debugging and auditing.

# Promote rescued fields to proper columns in silver
from pyspark.sql.functions import col, get_json_object

silver_df = (
    spark.table("bronze.raw_events")
    .withColumn("new_field", get_json_object(col("_rescued_data"), "$.new_field").cast("string"))
    .filter(col("event_id").isNotNull())
)

File Notification vs Directory Listing

The choice matters for latency and cost:

DimensionFile NotificationDirectory Listing
LatencySeconds (event-driven)Minutes (scan interval)
Cloud costMessaging service chargesStorage LIST API calls
SetupRequires cloud messaging (SQS/Event Grid/Pub Sub)Zero setup
ScaleHandles millions of files efficientlyCan be slow at large scale
ReliabilityDepends on cloud messaging SLASimple, no external dependency

For dev/test: directory listing. For production with high file volumes: file notification. You can switch modes later — the checkpoint is format-agnostic.

Common Mistakes and Pitfalls

1. Forgetting schemaLocation on the first run Autoloader won't enforce schema evolution without a schema location. The stream will work but re-infer every restart.

2. Using the same schemaLocation for multiple streams Each Autoloader stream needs its own schemaLocation. Sharing causes schema corruption.

3. Not handling _rescued_data in silver If you drop this column in silver without processing it, you silently lose data on every schema mismatch.

4. Directory listing on deep prefix hierarchies Listing s3://bucket/ recursively with millions of files causes expensive, slow LIST API calls. Always scope the path to the most specific prefix you need.

5. mergeSchema without monitoring Setting mergeSchema: true without monitoring _rescued_data means your silver tables silently accumulate schema drift.

Production Setup Checklist

✅ schemaLocation defined and backed up
✅ checkpointLocation on durable storage (not ephemeral)
✅ cloudFiles.schemaEvolutionMode explicitly set
✅ _rescued_data column monitored with alerts
✅ _metadata columns available for lineage
✅ File notification configured (production)
✅ mergeSchema: true for Delta sink
✅ Separate stream per data source (no shared schema locations)

When Autoloader Is Not the Right Tool

Autoloader shines for file-based ingestion from object storage. It's not a good fit for:

  • API polling — use structured streaming with a custom source or Kafka
  • Database CDC — use Debezium + Kafka + Delta Live Tables
  • Real-time sub-second latency — event hub / Kafka is faster and more reliable
  • Fixed-schema, high-volume, no evolution — a simple COPY INTO might be simpler and cheaper

If you're pulling data from REST APIs rather than files, Databricks Asset Bundles combined with scheduled jobs is often a cleaner pattern. For the silver-to-gold transformation layer, see our guide on Delta table optimization.

Harbinger Explorer for Autoloader Debugging

Once your bronze data lands in Delta, you still need to inspect it — especially _rescued_data and _metadata columns. If you don't have a Databricks SQL endpoint readily available, Harbinger Explorer lets you upload a Delta export or a sample CSV/JSON from your landing zone and run SQL directly in your browser via DuckDB WASM. It's useful for quick schema validation and rescued-data inspection without spinning up a warehouse.

Key Takeaways

Autoloader is one of the most production-ready incremental ingestion tools in the Databricks ecosystem. The three things to get right from the start: always set a schemaLocation, choose rescue as your evolution mode for bronze, and monitor _rescued_data religiously in silver. The rest is tuning.

Next: pair Autoloader with Databricks Streaming Tables for a fully declarative end-to-end pipeline.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...