
π Databricks Auto Loader Tutorial

(with Schema Evolution Modes & File Detection Modes)
Auto Loader in Databricks is the recommended way to ingest files incrementally and reliably into the Lakehouse. This tutorial covers:
- What Auto Loader is and when to use it
- File detection modes (Directory Listing vs File Notification)
- Schema handling (Schema Location, Schema Hints)
- Schema evolution modes (addNewColumns, rescue, none, failOnNewColumns)
- Exactly-once processing & checkpoints
- Full working examples

1οΈβ£ Introduction to Auto Loader
Auto Loader is a Databricks feature that:
- Incrementally ingests new files from cloud storage (ADLS, S3, GCS, or DBFS).
- Works with structured streaming via the
cloudFilessource. - Supports both batch and streaming ingestion.
- Guarantees exactly-once delivery (no duplicate loads).
- Can scale to millions of files per hour.
π Compared to COPY INTO, which is retriable and idempotent, Auto Loader is designed for large-scale continuous ingestion.
2οΈβ£ File Detection Modes in Auto Loader
Auto Loader uses two file detection modes to track new files:
πΉ a) Directory Listing (Default)
- Uses cloud storage list API calls.
- Tracks processed files in the checkpoint (
RocksDB). - Works out-of-the-box.
- Best for low-to-medium ingestion volumes.
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("dbfs:/mnt/landing/year=*/month=*/day=*"))
πΉ b) File Notification
- Uses event services (Azure Event Grid + Queue, AWS S3 + SQS, GCP Pub/Sub).
- Requires elevated cloud permissions to create these services.
- Efficient for very large ingestion pipelines.
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.useNotifications", "true")
.load("s3://mybucket/landing"))
π Tip: Start with Directory Listing β move to File Notification at scale.
3οΈβ£ Using Auto Loader in Databricks
Example: Reading from Nested Folder Structure
Suppose files are stored as:
/landing/year=2024/month=08/day=30/file.csv
We can read them with wildcards:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("pathGlobFilter", "*.csv")
.option("header", "true")
.option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/schema1")
.load("dbfs:/mnt/landing/year=*/month=*/day=*"))
Key Options:
cloudFiles.formatβ input format (csv,json,parquet).cloudFiles.schemaLocationβ path to store schema metadata.pathGlobFilterβ filter file extensions.headerβ handle CSV headers.
4οΈβ£ Schema Location in Auto Loader
Auto Loader requires a schema location:
- Stores schema evolution metadata.
- Ensures consistency across multiple runs.
- Lives in the checkpoint directory.
.option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/autoloader/schema")
Inside this folder:
_schemas/β schema historyrocksdb/β file tracking for exactly-once
5οΈβ£ Schema Hints in Auto Loader
Instead of defining the entire schema, you can hint only specific columns.
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaHints", "Quantity INT, UnitPrice DOUBLE")
.load("dbfs:/mnt/landing"))
This:
- Infers other columns automatically.
- Forces
Quantityas integer,UnitPriceas double.
β Useful when schema is evolving but certain columns should remain strongly typed.
6οΈβ£ Writing with Auto Loader
Write DataFrame into a Delta Table with checkpoints for exactly-once.
(df.withColumn("file_name", F.input_file_name()) # track source file
.writeStream
.option("checkpointLocation", "dbfs:/mnt/checkpoints/autoloader/run1")
.outputMode("append")
.trigger(availableNow=True) # batch-like run
.toTable("dev.bronze.sales_data"))
checkpointLocationβ prevents reprocessing of old files.trigger(availableNow=True)β processes once in batch style.toTable()β saves to a Delta table.
7οΈβ£ Schema Evolution in Auto Loader
When new columns appear in source files, Auto Loader handles it via schema evolution modes:
πΉ a) addNewColumns (Default)
- If new columns are detected:
- Stream fails once, updates schema in schemaLocation.
- Rerun β succeeds, new columns appear in the table.
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
πΉ b) rescue
- New columns β pushed into a special column
_rescued_data. - Stream does not fail.
- Useful when schema changes are frequent.
.option("cloudFiles.schemaEvolutionMode", "rescue")
Example Output:
| InvoiceNo | Quantity | _rescued_data |
|---|---|---|
| 12345 | 10 | {“State”: “CA”} |
πΉ c) none
- Ignores new columns completely.
- No schema updates, no
_rescued_data.
.option("cloudFiles.schemaEvolutionMode", "none")
πΉ d) failOnNewColumns
- If new columns appear β stream fails.
- Requires manual schema update.
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")
8οΈβ£ Incremental Ingestion & Exactly Once
- Processed files are tracked in RocksDB inside checkpoint.
- Already-processed files are not re-ingested.
- New files only β incremental load.
β This ensures idempotent and exactly-once ingestion.
9οΈβ£ File Notification Mode (Advanced)
Enable with:
.option("cloudFiles.useNotifications", "true")
This:
- Creates event-based triggers in your cloud account.
- Requires permissions to provision Event Grid (Azure), SQS (AWS), Pub/Sub (GCP).
- Best for large-scale ingestion with low-latency.
π Putting It All Together β Example Pipeline
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/schema")
.option("cloudFiles.schemaHints", "Quantity INT, UnitPrice DOUBLE")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load("dbfs:/mnt/landing/year=*/month=*/day=*"))
(df.writeStream
.option("checkpointLocation", "dbfs:/mnt/checkpoints/autoloader/full_pipeline")
.outputMode("append")
.trigger(availableNow=True)
.toTable("dev.bronze.sales_data"))
β Summary
- Auto Loader is the preferred way to ingest files into Databricks Lakehouse.
- File Detection Modes:
- Directory Listing (default).
- File Notification (event-driven, needs cloud perms).
- Schema Handling:
- Schema Location β track schema history.
- Schema Hints β enforce types for specific columns.
- Schema Evolution Modes β handle new columns gracefully:
- addNewColumns (default)
- rescue
- none
- failOnNewColumns
- Checkpoints ensure exactly-once ingestion.
- Use
availableNowtrigger for batch-like runs, or streaming triggers for continuous pipelines.