Databricks: Delta Tables MERGE & UPSERT (SCD1 + Soft Deletes)


This tutorial covers how to perform upserts (MERGE) in Delta tables on Databricks, with both hard deletes and soft deletes (using SCD1 style).


1. πŸ”Ή Introduction

In Delta Lake, the MERGE operation (also called UPSERT) lets you:

  • Update rows in the target table when keys match.
  • Insert rows when keys don’t exist.
  • Delete or mark inactive rows when they exist in target but not in source.

πŸ‘‰ This is the same concept as Slowly Changing Dimension Type 1 (SCD1) in data warehousing.


2. πŸ”Ή Setup: Create EMP (target) table

We’ll use an EMP table as the target.

CREATE TABLE IF NOT EXISTS dev.bronze.emp (
  emp_id INT,
  emp_name STRING,
  dept_code STRING,
  salary DOUBLE
);

-- Insert initial data
INSERT INTO dev.bronze.emp VALUES
(101, 'Alice', 'D101', 10000),
(102, 'Bob',   'D102', 12000),
(103, 'Charlie', 'D101', 11000),
(104, 'David', 'D103', 9000),
(105, 'Eve',   'D104', 9500);

-- Check data
SELECT * FROM dev.bronze.emp;

3. πŸ”Ή Create EMP_UPDATES (source) table

This table will hold updates and new rows.

CREATE TABLE dev.bronze.emp_updates (
  emp_id INT,
  emp_name STRING,
  dept_code STRING,
  salary DOUBLE
);

-- Insert updates (existing employees with new salary)
INSERT INTO dev.bronze.emp_updates VALUES
(101, 'Alice', 'D101', 15000),
(102, 'Bob',   'D102', 16000),
(104, 'David', 'D103', 17000);

-- Insert a new employee
INSERT INTO dev.bronze.emp_updates VALUES
(106, 'Rohan', 'D103', 10000);

-- Check source data
SELECT * FROM dev.bronze.emp_updates;

πŸ‘‰ Now we have 3 updates + 1 new record.


4. πŸ”Ή MERGE for Upserts

Update existing employees, insert new ones.

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN MATCHED THEN
  UPDATE SET e.salary = u.salary

WHEN NOT MATCHED THEN
  INSERT *;

πŸ“Œ Results:

  • Alice (101), Bob (102), David (104) β†’ salary updated.
  • Rohan (106) β†’ inserted as new row.

Check:

SELECT * FROM dev.bronze.emp;

5. πŸ”Ή Hard Deletes (Remove missing records)

Suppose source doesn’t have emp_id 103 and 105.
We want to delete them from target.

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN NOT MATCHED BY SOURCE THEN
  DELETE;

πŸ“Œ emp_id 103, 105 β†’ deleted.

Check:

SELECT * FROM dev.bronze.emp;

6. πŸ”Ή Soft Deletes (SCD1 style)

Instead of deleting, we mark records inactive.

Step 1: Add column

ALTER TABLE dev.bronze.emp ADD COLUMNS (is_active STRING);

πŸ‘‰ All existing rows get NULL by default.


Step 2: Modify MERGE for soft delete

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN MATCHED THEN
  UPDATE SET e.salary = u.salary, e.is_active = 'Y'

WHEN NOT MATCHED THEN
  INSERT (emp_id, emp_name, dept_code, salary, is_active)
  VALUES (u.emp_id, u.emp_name, u.dept_code, u.salary, 'Y')

WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET e.is_active = 'N';

πŸ“Œ Behavior:

  • Updated rows β†’ marked Y.
  • New rows β†’ inserted with Y.
  • Missing rows (e.g., 103, 105) β†’ marked as N instead of being deleted.

Check:

SELECT * FROM dev.bronze.emp;

7. πŸ”Ή Benefits of MERGE

  • Handles incremental loads seamlessly.
  • Supports real-time updates in Delta.
  • Allows hard deletes (completely remove) or soft deletes (mark inactive).
  • Enables auditable, consistent data pipelines.

βœ… Final Recap

  • MERGE (Upsert/SCD1) = Update existing + Insert new.
  • Can also delete missing records (WHEN NOT MATCHED BY SOURCE THEN DELETE).
  • For soft deletes, add a column (e.g., is_active) and mark inactive instead of removing.