Databricks: Databricks COPY INTO Command – Idempotent & Exactly-Once Data Loading


1. 🔹 What is COPY INTO?

  • COPY INTO is a Databricks SQL command to load data files into Delta tables.
  • Supported formats: CSV, JSON, Avro, Parquet, ORC, text, binary.
  • Key benefits:
    • Idempotent → Once a file is loaded, re-running COPY INTO will not reload it.
    • Exactly once semantics → Files are ingested only once, even across retries.
    • ✅ Handles schema inference & evolution.
    • ✅ Scalable for thousands of files.

👉 For millions of files or complex directories, use Autoloader instead.


2. 🔹 Setup: Managed Volume & Input Files

-- Create managed volume
CREATE VOLUME dev.bronze.landing;

-- Create input folder
%python
dbutils.fs.mkdirs("dbfs:/Volumes/dev/bronze/landing/input")

-- Copy sample files
dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-01.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)

dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-02.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)

Now we have two invoice CSV files ready in /landing/input.


3. 🔹 Placeholder Delta Table

We can create a table without schema → COPY INTO will infer columns automatically.

CREATE TABLE dev.bronze.invoice_cp;

This is an empty Delta table with no defined schema.


4. 🔹 COPY INTO Command

COPY INTO dev.bronze.invoice_cp
FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

✅ Loads all CSV files once.
✅ If rerun, skipped (because of COPY INTO metadata tracking).


5. 🔹 How Metadata is Tracked

  • COPY INTO maintains logs in the Delta table’s _delta_log/copy-into-log directory.
  • Each ingested file path + checksum is recorded.
  • On rerun → Skips already-processed files.

Check metadata:

DESCRIBE EXTENDED dev.bronze.invoice_cp;

Look in storage → _delta_log/copy-into-log.


6. 🔹 Transforming Data While Loading

You can transform/select columns during ingestion.

Example: Create table with selected columns

CREATE TABLE dev.bronze.invoice_alt (
  invoice_number STRING,
  stock_code STRING,
  quantity DOUBLE,
  insert_date TIMESTAMP
);

COPY INTO with transformations

COPY INTO dev.bronze.invoice_alt
FROM (
  SELECT 
    InvoiceNo as invoice_number,
    StockCode as stock_code,
    CAST(Quantity AS DOUBLE) as quantity,
    current_timestamp() as insert_date
  FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true');

7. 🔹 Incremental Loads

If you add new files (2021-03.csv), COPY INTO only ingests the new file:

dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-03.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)

Then rerun COPY INTO:

COPY INTO dev.bronze.invoice_alt
FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true');

👉 Only the new file’s rows will be inserted.


8. 🔹 Best Practices

  • Use placeholder table + mergeSchema if schema unknown.
  • For production, prefer explicit schema (better governance).
  • Keep landing zones clean → Avoid re-copying old files.
  • For large scale pipelines: switch to Autoloader.

✅ Summary

  • COPY INTO = Reliable, idempotent, exactly-once ingestion.
  • Maintains metadata in Delta logs.
  • Supports schema inference, schema evolution, and transformations.
  • Ideal for batch pipelines with manageable file counts.
  • For huge scale, prefer Autoloader.

Related Posts

Strategic Cloud Financial Management With Certified FinOps Professional Training

Introduction The Certified FinOps Professional program is a transformative milestone for any engineer or manager looking to master the intersection of finance, technology, and business operations. This…

Read More

Professional Certified FinOps Engineer improves financial performance visibility systems

Introduction In the modern landscape of cloud infrastructure, technical expertise alone is no longer sufficient to drive enterprise success. The Certified FinOps Engineer program has emerged as…

Read More

Complete Cloud Financial Management Guide for Certified FinOps Manager

Introduction The Certified FinOps Manager program is designed to bridge the widening gap between cloud engineering and financial accountability. As cloud environments become more complex, organizations require…

Read More

Industry Ready FinOps Knowledge Through Certified FinOps Architect Program

Introduction The Certified FinOps Architect certification is designed to help professionals bridge the gap between cloud financial management and operational efficiency. This guide is tailored for working…

Read More

Advance Your Data Management Career with CDOM – Certified DataOps Manager

The CDOM – Certified DataOps Manager is a breakthrough certification designed for professionals who want to master the intersection of data engineering and operational agility. This guide…

Read More

Future focused learning with CDOA – Certified DataOps Architect certification

Introduction The CDOA – Certified DataOps Architect is a professional designed to bridge the gap between data engineering and operational excellence. This guide is written for engineers…

Read More