1. πΉ What is COPY INTO?
COPY INTOis a Databricks SQL command to load data files into Delta tables.- Supported formats: CSV, JSON, Avro, Parquet, ORC, text, binary.
 - Key benefits:
- β
 Idempotent β Once a file is loaded, re-running 
COPY INTOwill not reload it. - β Exactly once semantics β Files are ingested only once, even across retries.
 - β Handles schema inference & evolution.
 - β Scalable for thousands of files.
 
 - β
 Idempotent β Once a file is loaded, re-running 
 
π For millions of files or complex directories, use Autoloader instead.
2. πΉ Setup: Managed Volume & Input Files
-- Create managed volume
CREATE VOLUME dev.bronze.landing;
-- Create input folder
%python
dbutils.fs.mkdirs("dbfs:/Volumes/dev/bronze/landing/input")
-- Copy sample files
dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-01.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)
dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-02.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)
Now we have two invoice CSV files ready in /landing/input.
3. πΉ Placeholder Delta Table
We can create a table without schema β COPY INTO will infer columns automatically.
CREATE TABLE dev.bronze.invoice_cp;
This is an empty Delta table with no defined schema.
4. πΉ COPY INTO Command
COPY INTO dev.bronze.invoice_cp
FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
β
 Loads all CSV files once.
β
 If rerun, skipped (because of COPY INTO metadata tracking).
5. πΉ How Metadata is Tracked
COPY INTOmaintains logs in the Delta tableβs_delta_log/copy-into-logdirectory.- Each ingested file path + checksum is recorded.
 - On rerun β Skips already-processed files.
 
Check metadata:
DESCRIBE EXTENDED dev.bronze.invoice_cp;
Look in storage β _delta_log/copy-into-log.
6. πΉ Transforming Data While Loading
You can transform/select columns during ingestion.
Example: Create table with selected columns
CREATE TABLE dev.bronze.invoice_alt (
  invoice_number STRING,
  stock_code STRING,
  quantity DOUBLE,
  insert_date TIMESTAMP
);
COPY INTO with transformations
COPY INTO dev.bronze.invoice_alt
FROM (
  SELECT 
    InvoiceNo as invoice_number,
    StockCode as stock_code,
    CAST(Quantity AS DOUBLE) as quantity,
    current_timestamp() as insert_date
  FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true');
7. πΉ Incremental Loads
If you add new files (2021-03.csv), COPY INTO only ingests the new file:
dbutils.fs.cp("dbfs:/databricks-datasets/retail-org/invoices/2021-03.csv",
              "dbfs:/Volumes/dev/bronze/landing/input/", recurse=True)
Then rerun COPY INTO:
COPY INTO dev.bronze.invoice_alt
FROM 'dbfs:/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true');
π Only the new fileβs rows will be inserted.
8. πΉ Best Practices
- Use placeholder table + mergeSchema if schema unknown.
 - For production, prefer explicit schema (better governance).
 - Keep landing zones clean β Avoid re-copying old files.
 - For large scale pipelines: switch to Autoloader.
 
β Summary
COPY INTO= Reliable, idempotent, exactly-once ingestion.- Maintains metadata in Delta logs.
 - Supports schema inference, schema evolution, and transformations.
 - Ideal for batch pipelines with manageable file counts.
 - For huge scale, prefer Autoloader.