{"id":801,"date":"2025-08-23T15:07:14","date_gmt":"2025-08-23T15:07:14","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=801"},"modified":"2025-08-23T15:07:15","modified_gmt":"2025-08-23T15:07:15","slug":"databricks-databricks-copy-into-command-idempotent-exactly-once-data-loading","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/databricks-databricks-copy-into-command-idempotent-exactly-once-data-loading\/","title":{"rendered":"Databricks: Databricks COPY INTO Command \u2013 Idempotent &amp; Exactly-Once Data Loading"},"content":{"rendered":"\n<p><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1. \ud83d\udd39 What is <code>COPY INTO<\/code>?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>COPY INTO<\/code> is a <strong>Databricks SQL command<\/strong> to load data files into Delta tables.<\/li>\n\n\n\n<li><strong>Supported formats<\/strong>: CSV, JSON, Avro, Parquet, ORC, text, binary.<\/li>\n\n\n\n<li><strong>Key benefits<\/strong>:\n<ul class=\"wp-block-list\">\n<li>\u2705 <strong>Idempotent<\/strong> \u2192 Once a file is loaded, re-running <code>COPY INTO<\/code> will <strong>not reload it<\/strong>.<\/li>\n\n\n\n<li>\u2705 <strong>Exactly once semantics<\/strong> \u2192 Files are ingested only once, even across retries.<\/li>\n\n\n\n<li>\u2705 Handles <strong>schema inference &amp; evolution<\/strong>.<\/li>\n\n\n\n<li>\u2705 Scalable for thousands of files.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udc49 For <strong>millions of files or complex directories<\/strong>, use <strong>Autoloader<\/strong> instead.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">2. \ud83d\udd39 Setup: Managed Volume &amp; Input Files<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>-- Create managed volume\nCREATE VOLUME dev.bronze.landing;\n\n-- Create input folder\n%python\ndbutils.fs.mkdirs(\"dbfs:\/Volumes\/dev\/bronze\/landing\/input\")\n\n-- Copy sample files\ndbutils.fs.cp(\"dbfs:\/databricks-datasets\/retail-org\/invoices\/2021-01.csv\",\n              \"dbfs:\/Volumes\/dev\/bronze\/landing\/input\/\", recurse=True)\n\ndbutils.fs.cp(\"dbfs:\/databricks-datasets\/retail-org\/invoices\/2021-02.csv\",\n              \"dbfs:\/Volumes\/dev\/bronze\/landing\/input\/\", recurse=True)\n<\/code><\/pre>\n\n\n\n<p>Now we have two invoice CSV files ready in <code>\/landing\/input<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">3. \ud83d\udd39 Placeholder Delta Table<\/h2>\n\n\n\n<p>We can create a <strong>table without schema<\/strong> \u2192 COPY INTO will infer columns automatically.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE dev.bronze.invoice_cp;\n<\/code><\/pre>\n\n\n\n<p>This is an <strong>empty Delta table<\/strong> with no defined schema.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">4. \ud83d\udd39 COPY INTO Command<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>COPY INTO dev.bronze.invoice_cp\nFROM 'dbfs:\/Volumes\/dev\/bronze\/landing\/input\/'\nFILEFORMAT = CSV\nPATTERN = '*.csv'\nFORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')\nCOPY_OPTIONS ('mergeSchema' = 'true');\n<\/code><\/pre>\n\n\n\n<p>\u2705 Loads all CSV files once.<br>\u2705 If rerun, <strong>skipped<\/strong> (because of COPY INTO metadata tracking).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">5. \ud83d\udd39 How Metadata is Tracked<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>COPY INTO<\/code> maintains logs in the <strong>Delta table\u2019s <code>_delta_log\/copy-into-log<\/code><\/strong> directory.<\/li>\n\n\n\n<li>Each ingested file path + checksum is recorded.<\/li>\n\n\n\n<li>On rerun \u2192 Skips already-processed files.<\/li>\n<\/ul>\n\n\n\n<p>Check metadata:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DESCRIBE EXTENDED dev.bronze.invoice_cp;\n<\/code><\/pre>\n\n\n\n<p>Look in storage \u2192 <code>_delta_log\/copy-into-log<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">6. \ud83d\udd39 Transforming Data While Loading<\/h2>\n\n\n\n<p>You can transform\/select columns during ingestion.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example: Create table with selected columns<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE dev.bronze.invoice_alt (\n  invoice_number STRING,\n  stock_code STRING,\n  quantity DOUBLE,\n  insert_date TIMESTAMP\n);\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">COPY INTO with transformations<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>COPY INTO dev.bronze.invoice_alt\nFROM (\n  SELECT \n    InvoiceNo as invoice_number,\n    StockCode as stock_code,\n    CAST(Quantity AS DOUBLE) as quantity,\n    current_timestamp() as insert_date\n  FROM 'dbfs:\/Volumes\/dev\/bronze\/landing\/input\/'\n)\nFILEFORMAT = CSV\nPATTERN = '*.csv'\nFORMAT_OPTIONS ('header' = 'true');\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">7. \ud83d\udd39 Incremental Loads<\/h2>\n\n\n\n<p>If you add new files (<code>2021-03.csv<\/code>), COPY INTO only ingests the <strong>new file<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dbutils.fs.cp(\"dbfs:\/databricks-datasets\/retail-org\/invoices\/2021-03.csv\",\n              \"dbfs:\/Volumes\/dev\/bronze\/landing\/input\/\", recurse=True)\n<\/code><\/pre>\n\n\n\n<p>Then rerun COPY INTO:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>COPY INTO dev.bronze.invoice_alt\nFROM 'dbfs:\/Volumes\/dev\/bronze\/landing\/input\/'\nFILEFORMAT = CSV\nPATTERN = '*.csv'\nFORMAT_OPTIONS ('header' = 'true');\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udc49 Only the new file\u2019s rows will be inserted.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">8. \ud83d\udd39 Best Practices<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>placeholder table + mergeSchema<\/strong> if schema unknown.<\/li>\n\n\n\n<li>For production, prefer <strong>explicit schema<\/strong> (better governance).<\/li>\n\n\n\n<li>Keep <strong>landing zones clean<\/strong> \u2192 Avoid re-copying old files.<\/li>\n\n\n\n<li>For <strong>large scale pipelines<\/strong>: switch to <strong>Autoloader<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\u2705 Summary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>COPY INTO<\/code> = Reliable, idempotent, exactly-once ingestion.<\/li>\n\n\n\n<li>Maintains metadata in <strong>Delta logs<\/strong>.<\/li>\n\n\n\n<li>Supports <strong>schema inference, schema evolution, and transformations<\/strong>.<\/li>\n\n\n\n<li>Ideal for <strong>batch pipelines with manageable file counts<\/strong>.<\/li>\n\n\n\n<li>For huge scale, prefer <strong>Autoloader<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. \ud83d\udd39 What is COPY INTO? \ud83d\udc49 For millions of files or complex directories, use Autoloader instead. 2. \ud83d\udd39 Setup: Managed Volume &amp; Input Files Now we&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-801","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=801"}],"version-history":[{"count":1,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/801\/revisions"}],"predecessor-version":[{"id":802,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/801\/revisions\/802"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}