Databricks Autoloader: The Complete Guide
Ingesting files from S3, ADLS, or GCS sounds simple until you're dealing with schema drift, duplicate processing, and terabytes landing every hour. Databricks Autoloader is the answer most teams reach for — but its defaults hide some sharp edges worth knowing before you go to production.
What Is Databricks Autoloader?
Databricks Autoloader (cloudFiles) is a structured streaming source that incrementally ingests new files from cloud object storage. Instead of scanning the entire storage prefix on every run, it tracks which files have been processed using either file notification (event-driven, via cloud messaging) or directory listing (periodic scan). For most production use-cases, file notification mode is faster and cheaper.
# PySpark — Autoloader with file notification (ADLS Gen2 example)
df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/checkpoints/landing_schema")
.option("cloudFiles.useNotifications", "true") # file notification mode
.load("abfss://raw@storageaccount.dfs.core.windows.net/events/")
)
(
df.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/landing_checkpoint")
.option("mergeSchema", "true")
.outputMode("append")
.table("bronze.raw_events")
)
The schemaLocation is not optional in practice — it's where Autoloader persists the inferred schema between runs. Without it, every restart re-infers from scratch.
Schema Inference and Evolution
Autoloader can infer schema from your files automatically. For JSON and CSV, it samples the first cloudFiles.inferColumnTypes rows (default: 10,000) and writes the inferred schema to the schemaLocation. On subsequent runs it reads from there instead of re-inferring.
# Schema inference options
.option("cloudFiles.inferColumnTypes", "true") # infer int/long vs string
.option("cloudFiles.schemaEvolutionMode", "addNewColumns") # | rescue | failOnNewColumns | none
Schema evolution modes — when to use which:
| Mode | Behavior | Best for |
|---|---|---|
addNewColumns | New columns added to schema, old data has nulls | Append-heavy pipelines |
rescue | Unknown columns written to _rescued_data JSON column | When schema changes are expected but data must not be lost |
failOnNewColumns | Stream fails on any new column | Strict bronze → silver SLAs |
none | No evolution, unknown columns dropped silently | Known stable schemas |
For most bronze ingestion layers, rescue is the right default — you never lose data, and you can inspect _rescued_data in silver to handle edge cases.
Rescue Data: Your Safety Net
The _rescued_data column is Autoloader's most underappreciated feature. Any field that doesn't match the current schema lands here as a JSON string — including malformed values, unexpected types, and genuinely new fields.
# Spark SQL — inspect rescued data in bronze table
SELECT
event_id,
event_time,
_rescued_data,
_metadata.file_path AS source_file,
_metadata.file_modification_time AS file_ts
FROM bronze.raw_events
WHERE _rescued_data IS NOT NULL
LIMIT 100;
The _metadata column (automatically added by Autoloader) gives you the source file path, modification time, and size — invaluable for debugging and auditing.
# Promote rescued fields to proper columns in silver
from pyspark.sql.functions import col, get_json_object
silver_df = (
spark.table("bronze.raw_events")
.withColumn("new_field", get_json_object(col("_rescued_data"), "$.new_field").cast("string"))
.filter(col("event_id").isNotNull())
)
File Notification vs Directory Listing
The choice matters for latency and cost:
| Dimension | File Notification | Directory Listing |
|---|---|---|
| Latency | Seconds (event-driven) | Minutes (scan interval) |
| Cloud cost | Messaging service charges | Storage LIST API calls |
| Setup | Requires cloud messaging (SQS/Event Grid/Pub Sub) | Zero setup |
| Scale | Handles millions of files efficiently | Can be slow at large scale |
| Reliability | Depends on cloud messaging SLA | Simple, no external dependency |
For dev/test: directory listing. For production with high file volumes: file notification. You can switch modes later — the checkpoint is format-agnostic.
Common Mistakes and Pitfalls
1. Forgetting schemaLocation on the first run
Autoloader won't enforce schema evolution without a schema location. The stream will work but re-infer every restart.
2. Using the same schemaLocation for multiple streams
Each Autoloader stream needs its own schemaLocation. Sharing causes schema corruption.
3. Not handling _rescued_data in silver
If you drop this column in silver without processing it, you silently lose data on every schema mismatch.
4. Directory listing on deep prefix hierarchies
Listing s3://bucket/ recursively with millions of files causes expensive, slow LIST API calls. Always scope the path to the most specific prefix you need.
5. mergeSchema without monitoring
Setting mergeSchema: true without monitoring _rescued_data means your silver tables silently accumulate schema drift.
Production Setup Checklist
✅ schemaLocation defined and backed up
✅ checkpointLocation on durable storage (not ephemeral)
✅ cloudFiles.schemaEvolutionMode explicitly set
✅ _rescued_data column monitored with alerts
✅ _metadata columns available for lineage
✅ File notification configured (production)
✅ mergeSchema: true for Delta sink
✅ Separate stream per data source (no shared schema locations)
When Autoloader Is Not the Right Tool
Autoloader shines for file-based ingestion from object storage. It's not a good fit for:
- API polling — use structured streaming with a custom source or Kafka
- Database CDC — use Debezium + Kafka + Delta Live Tables
- Real-time sub-second latency — event hub / Kafka is faster and more reliable
- Fixed-schema, high-volume, no evolution — a simple COPY INTO might be simpler and cheaper
If you're pulling data from REST APIs rather than files, Databricks Asset Bundles combined with scheduled jobs is often a cleaner pattern. For the silver-to-gold transformation layer, see our guide on Delta table optimization.
Harbinger Explorer for Autoloader Debugging
Once your bronze data lands in Delta, you still need to inspect it — especially _rescued_data and _metadata columns. If you don't have a Databricks SQL endpoint readily available, Harbinger Explorer lets you upload a Delta export or a sample CSV/JSON from your landing zone and run SQL directly in your browser via DuckDB WASM. It's useful for quick schema validation and rescued-data inspection without spinning up a warehouse.
Key Takeaways
Autoloader is one of the most production-ready incremental ingestion tools in the Databricks ecosystem. The three things to get right from the start: always set a schemaLocation, choose rescue as your evolution mode for bronze, and monitor _rescued_data religiously in silver. The rest is tuning.
Next: pair Autoloader with Databricks Streaming Tables for a fully declarative end-to-end pipeline.
Continue Reading
Continue Reading
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Databricks Asset Bundles (DABs): The Complete Deployment Guide
A comprehensive guide to Databricks Asset Bundles (DABs) — define, test, and deploy Databricks resources as code with CI/CD pipelines, multi-environment support, and GitOps best practices.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial