{"id":803,"date":"2025-08-23T15:29:18","date_gmt":"2025-08-23T15:29:18","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=803"},"modified":"2025-09-01T15:43:15","modified_gmt":"2025-09-01T15:43:15","slug":"databricks-databricks-auto-loader-tutorial","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/databricks-databricks-auto-loader-tutorial\/","title":{"rendered":"Databricks: Databricks Auto Loader Tutorial"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"565\" src=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-1024x565.png\" alt=\"\" class=\"wp-image-808\" srcset=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-1024x565.png 1024w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-300x166.png 300w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-768x424.png 768w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-1536x848.png 1536w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-15-2048x1130.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">\ud83d\ude80 Databricks Auto Loader Tutorial<\/h1>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/autoloaders-1024x683.png\" alt=\"\" class=\"wp-image-813\" srcset=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/autoloaders-1024x683.png 1024w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/autoloaders-300x200.png 300w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/autoloaders-768x512.png 768w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/autoloaders.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>(with Schema Evolution Modes &amp; File Detection Modes)<\/strong><\/p>\n\n\n\n<p>Auto Loader in Databricks is the recommended way to <strong>ingest files incrementally and reliably<\/strong> into the Lakehouse. This tutorial covers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What Auto Loader is and when to use it<\/li>\n\n\n\n<li>File detection modes (Directory Listing vs File Notification)<\/li>\n\n\n\n<li>Schema handling (Schema Location, Schema Hints)<\/li>\n\n\n\n<li>Schema evolution modes (addNewColumns, rescue, none, failOnNewColumns)<\/li>\n\n\n\n<li>Exactly-once processing &amp; checkpoints<\/li>\n\n\n\n<li>Full working examples<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"315\" src=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-16.png\" alt=\"\" class=\"wp-image-810\" srcset=\"https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-16.png 600w, https:\/\/dataopsschool.com\/blog\/wp-content\/uploads\/2025\/08\/image-16-300x158.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1\ufe0f\u20e3 Introduction to Auto Loader<\/h2>\n\n\n\n<p>Auto Loader is a Databricks feature that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incrementally ingests <strong>new files<\/strong> from cloud storage (ADLS, S3, GCS, or DBFS).<\/li>\n\n\n\n<li>Works with <strong>structured streaming<\/strong> via the <code>cloudFiles<\/code> source.<\/li>\n\n\n\n<li>Supports <strong>both batch and streaming<\/strong> ingestion.<\/li>\n\n\n\n<li>Guarantees <strong>exactly-once delivery<\/strong> (no duplicate loads).<\/li>\n\n\n\n<li>Can scale to <strong>millions of files per hour<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udc49 Compared to <code>COPY INTO<\/code>, which is retriable and idempotent, <strong>Auto Loader is designed for large-scale continuous ingestion<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">2\ufe0f\u20e3 File Detection Modes in Auto Loader<\/h2>\n\n\n\n<p>Auto Loader uses <strong>two file detection modes<\/strong> to track new files:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 a) Directory Listing (Default)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses <strong>cloud storage list API<\/strong> calls.<\/li>\n\n\n\n<li>Tracks processed files in the checkpoint (<code>RocksDB<\/code>).<\/li>\n\n\n\n<li>Works <strong>out-of-the-box<\/strong>.<\/li>\n\n\n\n<li>Best for <strong>low-to-medium ingestion<\/strong> volumes.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>df = (spark.readStream\n      .format(\"cloudFiles\")\n      .option(\"cloudFiles.format\", \"csv\")\n      .load(\"dbfs:\/mnt\/landing\/year=*\/month=*\/day=*\"))\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 b) File Notification<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses <strong>event services<\/strong> (Azure Event Grid + Queue, AWS S3 + SQS, GCP Pub\/Sub).<\/li>\n\n\n\n<li>Requires <strong>elevated cloud permissions<\/strong> to create these services.<\/li>\n\n\n\n<li>Efficient for <strong>very large ingestion pipelines<\/strong>.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>df = (spark.readStream\n      .format(\"cloudFiles\")\n      .option(\"cloudFiles.format\", \"csv\")\n      .option(\"cloudFiles.useNotifications\", \"true\")\n      .load(\"s3:\/\/mybucket\/landing\"))\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udccc <strong>Tip<\/strong>: Start with Directory Listing \u2192 move to File Notification at scale.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">3\ufe0f\u20e3 Using Auto Loader in Databricks<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Example: Reading from Nested Folder Structure<\/h3>\n\n\n\n<p>Suppose files are stored as:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/landing\/year=2024\/month=08\/day=30\/file.csv\n<\/code><\/pre>\n\n\n\n<p>We can read them with wildcards:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = (spark.readStream\n      .format(\"cloudFiles\")\n      .option(\"cloudFiles.format\", \"csv\")\n      .option(\"pathGlobFilter\", \"*.csv\")\n      .option(\"header\", \"true\")\n      .option(\"cloudFiles.schemaLocation\", \"dbfs:\/mnt\/checkpoints\/schema1\")\n      .load(\"dbfs:\/mnt\/landing\/year=*\/month=*\/day=*\"))\n<\/code><\/pre>\n\n\n\n<p>Key Options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>cloudFiles.format<\/code> \u2192 input format (<code>csv<\/code>, <code>json<\/code>, <code>parquet<\/code>).<\/li>\n\n\n\n<li><code>cloudFiles.schemaLocation<\/code> \u2192 path to <strong>store schema metadata<\/strong>.<\/li>\n\n\n\n<li><code>pathGlobFilter<\/code> \u2192 filter file extensions.<\/li>\n\n\n\n<li><code>header<\/code> \u2192 handle CSV headers.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">4\ufe0f\u20e3 Schema Location in Auto Loader<\/h2>\n\n\n\n<p>Auto Loader requires a <strong>schema location<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Stores schema evolution metadata.<\/li>\n\n\n\n<li>Ensures consistency across multiple runs.<\/li>\n\n\n\n<li>Lives in the <strong>checkpoint directory<\/strong>.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.schemaLocation\", \"dbfs:\/mnt\/checkpoints\/autoloader\/schema\")\n<\/code><\/pre>\n\n\n\n<p>Inside this folder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>_schemas\/<\/code> \u2192 schema history<\/li>\n\n\n\n<li><code>rocksdb\/<\/code> \u2192 file tracking for exactly-once<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">5\ufe0f\u20e3 Schema Hints in Auto Loader<\/h2>\n\n\n\n<p>Instead of defining the <strong>entire schema<\/strong>, you can hint only specific columns.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = (spark.readStream\n      .format(\"cloudFiles\")\n      .option(\"cloudFiles.format\", \"csv\")\n      .option(\"cloudFiles.schemaHints\", \"Quantity INT, UnitPrice DOUBLE\")\n      .load(\"dbfs:\/mnt\/landing\"))\n<\/code><\/pre>\n\n\n\n<p>This:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infers other columns automatically.<\/li>\n\n\n\n<li>Forces <code>Quantity<\/code> as integer, <code>UnitPrice<\/code> as double.<\/li>\n<\/ul>\n\n\n\n<p>\u2705 Useful when schema is evolving but certain columns should remain strongly typed.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">6\ufe0f\u20e3 Writing with Auto Loader<\/h2>\n\n\n\n<p>Write DataFrame into a <strong>Delta Table<\/strong> with checkpoints for exactly-once.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>(df.withColumn(\"file_name\", F.input_file_name())  # track source file\n   .writeStream\n   .option(\"checkpointLocation\", \"dbfs:\/mnt\/checkpoints\/autoloader\/run1\")\n   .outputMode(\"append\")\n   .trigger(availableNow=True)   # batch-like run\n   .toTable(\"dev.bronze.sales_data\"))\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>checkpointLocation<\/code> \u2192 prevents reprocessing of old files.<\/li>\n\n\n\n<li><code>trigger(availableNow=True)<\/code> \u2192 processes once in batch style.<\/li>\n\n\n\n<li><code>toTable()<\/code> \u2192 saves to a Delta table.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">7\ufe0f\u20e3 Schema Evolution in Auto Loader<\/h2>\n\n\n\n<p>When new columns appear in source files, Auto Loader handles it via <strong>schema evolution modes<\/strong>:<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 a) <code>addNewColumns<\/code> (Default)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If new columns are detected:\n<ul class=\"wp-block-list\">\n<li>Stream <strong>fails once<\/strong>, updates schema in schemaLocation.<\/li>\n\n\n\n<li>Rerun \u2192 succeeds, new columns appear in the table.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.schemaEvolutionMode\", \"addNewColumns\")\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 b) <code>rescue<\/code><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>New columns \u2192 pushed into a special column <code>_rescued_data<\/code>.<\/li>\n\n\n\n<li>Stream does not fail.<\/li>\n\n\n\n<li>Useful when schema changes are frequent.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.schemaEvolutionMode\", \"rescue\")\n<\/code><\/pre>\n\n\n\n<p><strong>Example Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>InvoiceNo<\/th><th>Quantity<\/th><th>_rescued_data<\/th><\/tr><\/thead><tbody><tr><td>12345<\/td><td>10<\/td><td>{&#8220;State&#8221;: &#8220;CA&#8221;}<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 c) <code>none<\/code><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ignores new columns completely.<\/li>\n\n\n\n<li>No schema updates, no <code>_rescued_data<\/code>.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.schemaEvolutionMode\", \"none\")\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd39 d) <code>failOnNewColumns<\/code><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If new columns appear \u2192 stream fails.<\/li>\n\n\n\n<li>Requires manual schema update.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.schemaEvolutionMode\", \"failOnNewColumns\")\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">8\ufe0f\u20e3 Incremental Ingestion &amp; Exactly Once<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Processed files are tracked in <strong>RocksDB<\/strong> inside checkpoint.<\/li>\n\n\n\n<li>Already-processed files are <strong>not re-ingested<\/strong>.<\/li>\n\n\n\n<li>New files only \u2192 <strong>incremental load<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p>\u2705 This ensures <strong>idempotent and exactly-once ingestion<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">9\ufe0f\u20e3 File Notification Mode (Advanced)<\/h2>\n\n\n\n<p>Enable with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>.option(\"cloudFiles.useNotifications\", \"true\")\n<\/code><\/pre>\n\n\n\n<p>This:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Creates event-based triggers in your cloud account.<\/li>\n\n\n\n<li>Requires permissions to provision Event Grid (Azure), SQS (AWS), Pub\/Sub (GCP).<\/li>\n\n\n\n<li>Best for <strong>large-scale ingestion<\/strong> with low-latency.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd1f Putting It All Together \u2014 Example Pipeline<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>df = (spark.readStream\n      .format(\"cloudFiles\")\n      .option(\"cloudFiles.format\", \"csv\")\n      .option(\"cloudFiles.schemaLocation\", \"dbfs:\/mnt\/checkpoints\/schema\")\n      .option(\"cloudFiles.schemaHints\", \"Quantity INT, UnitPrice DOUBLE\")\n      .option(\"cloudFiles.schemaEvolutionMode\", \"rescue\")\n      .load(\"dbfs:\/mnt\/landing\/year=*\/month=*\/day=*\"))\n\n(df.writeStream\n   .option(\"checkpointLocation\", \"dbfs:\/mnt\/checkpoints\/autoloader\/full_pipeline\")\n   .outputMode(\"append\")\n   .trigger(availableNow=True)\n   .toTable(\"dev.bronze.sales_data\"))\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">\u2705 Summary<\/h1>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Auto Loader<\/strong> is the preferred way to ingest files into Databricks Lakehouse.<\/li>\n\n\n\n<li><strong>File Detection Modes<\/strong>:\n<ul class=\"wp-block-list\">\n<li>Directory Listing (default).<\/li>\n\n\n\n<li>File Notification (event-driven, needs cloud perms).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Schema Handling<\/strong>:\n<ul class=\"wp-block-list\">\n<li>Schema Location \u2192 track schema history.<\/li>\n\n\n\n<li>Schema Hints \u2192 enforce types for specific columns.<\/li>\n\n\n\n<li>Schema Evolution Modes \u2192 handle new columns gracefully:\n<ul class=\"wp-block-list\">\n<li>addNewColumns (default)<\/li>\n\n\n\n<li>rescue<\/li>\n\n\n\n<li>none<\/li>\n\n\n\n<li>failOnNewColumns<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Checkpoints<\/strong> ensure exactly-once ingestion.<\/li>\n\n\n\n<li>Use <code>availableNow<\/code> trigger for batch-like runs, or streaming triggers for continuous pipelines.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ud83d\ude80 Databricks Auto Loader Tutorial (with Schema Evolution Modes &amp; File Detection Modes) Auto Loader in Databricks is the recommended way to ingest files incrementally and reliably&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-803","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/803","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=803"}],"version-history":[{"count":5,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/803\/revisions"}],"predecessor-version":[{"id":828,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/803\/revisions\/828"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}