Databricks: Databricks Auto Loader Tutorial


πŸš€ Databricks Auto Loader Tutorial

(with Schema Evolution Modes & File Detection Modes)

Auto Loader in Databricks is the recommended way to ingest files incrementally and reliably into the Lakehouse. This tutorial covers:

  • What Auto Loader is and when to use it
  • File detection modes (Directory Listing vs File Notification)
  • Schema handling (Schema Location, Schema Hints)
  • Schema evolution modes (addNewColumns, rescue, none, failOnNewColumns)
  • Exactly-once processing & checkpoints
  • Full working examples

1️⃣ Introduction to Auto Loader

Auto Loader is a Databricks feature that:

  • Incrementally ingests new files from cloud storage (ADLS, S3, GCS, or DBFS).
  • Works with structured streaming via the cloudFiles source.
  • Supports both batch and streaming ingestion.
  • Guarantees exactly-once delivery (no duplicate loads).
  • Can scale to millions of files per hour.

πŸ‘‰ Compared to COPY INTO, which is retriable and idempotent, Auto Loader is designed for large-scale continuous ingestion.


2️⃣ File Detection Modes in Auto Loader

Auto Loader uses two file detection modes to track new files:

πŸ”Ή a) Directory Listing (Default)

  • Uses cloud storage list API calls.
  • Tracks processed files in the checkpoint (RocksDB).
  • Works out-of-the-box.
  • Best for low-to-medium ingestion volumes.
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .load("dbfs:/mnt/landing/year=*/month=*/day=*"))

πŸ”Ή b) File Notification

  • Uses event services (Azure Event Grid + Queue, AWS S3 + SQS, GCP Pub/Sub).
  • Requires elevated cloud permissions to create these services.
  • Efficient for very large ingestion pipelines.
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.useNotifications", "true")
      .load("s3://mybucket/landing"))

πŸ“Œ Tip: Start with Directory Listing β†’ move to File Notification at scale.


3️⃣ Using Auto Loader in Databricks

Example: Reading from Nested Folder Structure

Suppose files are stored as:

/landing/year=2024/month=08/day=30/file.csv

We can read them with wildcards:

df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("pathGlobFilter", "*.csv")
      .option("header", "true")
      .option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/schema1")
      .load("dbfs:/mnt/landing/year=*/month=*/day=*"))

Key Options:

  • cloudFiles.format β†’ input format (csv, json, parquet).
  • cloudFiles.schemaLocation β†’ path to store schema metadata.
  • pathGlobFilter β†’ filter file extensions.
  • header β†’ handle CSV headers.

4️⃣ Schema Location in Auto Loader

Auto Loader requires a schema location:

  • Stores schema evolution metadata.
  • Ensures consistency across multiple runs.
  • Lives in the checkpoint directory.
.option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/autoloader/schema")

Inside this folder:

  • _schemas/ β†’ schema history
  • rocksdb/ β†’ file tracking for exactly-once

5️⃣ Schema Hints in Auto Loader

Instead of defining the entire schema, you can hint only specific columns.

df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaHints", "Quantity INT, UnitPrice DOUBLE")
      .load("dbfs:/mnt/landing"))

This:

  • Infers other columns automatically.
  • Forces Quantity as integer, UnitPrice as double.

βœ… Useful when schema is evolving but certain columns should remain strongly typed.


6️⃣ Writing with Auto Loader

Write DataFrame into a Delta Table with checkpoints for exactly-once.

(df.withColumn("file_name", F.input_file_name())  # track source file
   .writeStream
   .option("checkpointLocation", "dbfs:/mnt/checkpoints/autoloader/run1")
   .outputMode("append")
   .trigger(availableNow=True)   # batch-like run
   .toTable("dev.bronze.sales_data"))
  • checkpointLocation β†’ prevents reprocessing of old files.
  • trigger(availableNow=True) β†’ processes once in batch style.
  • toTable() β†’ saves to a Delta table.

7️⃣ Schema Evolution in Auto Loader

When new columns appear in source files, Auto Loader handles it via schema evolution modes:


πŸ”Ή a) addNewColumns (Default)

  • If new columns are detected:
    • Stream fails once, updates schema in schemaLocation.
    • Rerun β†’ succeeds, new columns appear in the table.
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

πŸ”Ή b) rescue

  • New columns β†’ pushed into a special column _rescued_data.
  • Stream does not fail.
  • Useful when schema changes are frequent.
.option("cloudFiles.schemaEvolutionMode", "rescue")

Example Output:

InvoiceNoQuantity_rescued_data
1234510{“State”: “CA”}

πŸ”Ή c) none

  • Ignores new columns completely.
  • No schema updates, no _rescued_data.
.option("cloudFiles.schemaEvolutionMode", "none")

πŸ”Ή d) failOnNewColumns

  • If new columns appear β†’ stream fails.
  • Requires manual schema update.
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")

8️⃣ Incremental Ingestion & Exactly Once

  • Processed files are tracked in RocksDB inside checkpoint.
  • Already-processed files are not re-ingested.
  • New files only β†’ incremental load.

βœ… This ensures idempotent and exactly-once ingestion.


9️⃣ File Notification Mode (Advanced)

Enable with:

.option("cloudFiles.useNotifications", "true")

This:

  • Creates event-based triggers in your cloud account.
  • Requires permissions to provision Event Grid (Azure), SQS (AWS), Pub/Sub (GCP).
  • Best for large-scale ingestion with low-latency.

πŸ”Ÿ Putting It All Together β€” Example Pipeline

df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", "dbfs:/mnt/checkpoints/schema")
      .option("cloudFiles.schemaHints", "Quantity INT, UnitPrice DOUBLE")
      .option("cloudFiles.schemaEvolutionMode", "rescue")
      .load("dbfs:/mnt/landing/year=*/month=*/day=*"))

(df.writeStream
   .option("checkpointLocation", "dbfs:/mnt/checkpoints/autoloader/full_pipeline")
   .outputMode("append")
   .trigger(availableNow=True)
   .toTable("dev.bronze.sales_data"))

βœ… Summary

  • Auto Loader is the preferred way to ingest files into Databricks Lakehouse.
  • File Detection Modes:
    • Directory Listing (default).
    • File Notification (event-driven, needs cloud perms).
  • Schema Handling:
    • Schema Location β†’ track schema history.
    • Schema Hints β†’ enforce types for specific columns.
    • Schema Evolution Modes β†’ handle new columns gracefully:
      • addNewColumns (default)
      • rescue
      • none
      • failOnNewColumns
  • Checkpoints ensure exactly-once ingestion.
  • Use availableNow trigger for batch-like runs, or streaming triggers for continuous pipelines.

Related Posts

Strategic Cloud Financial Management With Certified FinOps Professional Training

Introduction The Certified FinOps Professional program is a transformative milestone for any engineer or manager looking to master the intersection of finance, technology, and business operations. This…

Read More

Professional Certified FinOps Engineer improves financial performance visibility systems

Introduction In the modern landscape of cloud infrastructure, technical expertise alone is no longer sufficient to drive enterprise success. The Certified FinOps Engineer program has emerged as…

Read More

Complete Cloud Financial Management Guide for Certified FinOps Manager

Introduction The Certified FinOps Manager program is designed to bridge the widening gap between cloud engineering and financial accountability. As cloud environments become more complex, organizations require…

Read More

Industry Ready FinOps Knowledge Through Certified FinOps Architect Program

Introduction The Certified FinOps Architect certification is designed to help professionals bridge the gap between cloud financial management and operational efficiency. This guide is tailored for working…

Read More

Advance Your Data Management Career with CDOM – Certified DataOps Manager

The CDOM – Certified DataOps Manager is a breakthrough certification designed for professionals who want to master the intersection of data engineering and operational agility. This guide…

Read More

Future focused learning with CDOA – Certified DataOps Architect certification

Introduction The CDOA – Certified DataOps Architect is a professional designed to bridge the gap between data engineering and operational excellence. This guide is written for engineers…

Read More