Databricks: Delta Tables MERGE & UPSERT (SCD1 + Soft Deletes)


This tutorial covers how to perform upserts (MERGE) in Delta tables on Databricks, with both hard deletes and soft deletes (using SCD1 style).


1. πŸ”Ή Introduction

In Delta Lake, the MERGE operation (also called UPSERT) lets you:

  • Update rows in the target table when keys match.
  • Insert rows when keys don’t exist.
  • Delete or mark inactive rows when they exist in target but not in source.

πŸ‘‰ This is the same concept as Slowly Changing Dimension Type 1 (SCD1) in data warehousing.


2. πŸ”Ή Setup: Create EMP (target) table

We’ll use an EMP table as the target.

CREATE TABLE IF NOT EXISTS dev.bronze.emp (
  emp_id INT,
  emp_name STRING,
  dept_code STRING,
  salary DOUBLE
);

-- Insert initial data
INSERT INTO dev.bronze.emp VALUES
(101, 'Alice', 'D101', 10000),
(102, 'Bob',   'D102', 12000),
(103, 'Charlie', 'D101', 11000),
(104, 'David', 'D103', 9000),
(105, 'Eve',   'D104', 9500);

-- Check data
SELECT * FROM dev.bronze.emp;

3. πŸ”Ή Create EMP_UPDATES (source) table

This table will hold updates and new rows.

CREATE TABLE dev.bronze.emp_updates (
  emp_id INT,
  emp_name STRING,
  dept_code STRING,
  salary DOUBLE
);

-- Insert updates (existing employees with new salary)
INSERT INTO dev.bronze.emp_updates VALUES
(101, 'Alice', 'D101', 15000),
(102, 'Bob',   'D102', 16000),
(104, 'David', 'D103', 17000);

-- Insert a new employee
INSERT INTO dev.bronze.emp_updates VALUES
(106, 'Rohan', 'D103', 10000);

-- Check source data
SELECT * FROM dev.bronze.emp_updates;

πŸ‘‰ Now we have 3 updates + 1 new record.


4. πŸ”Ή MERGE for Upserts

Update existing employees, insert new ones.

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN MATCHED THEN
  UPDATE SET e.salary = u.salary

WHEN NOT MATCHED THEN
  INSERT *;

πŸ“Œ Results:

  • Alice (101), Bob (102), David (104) β†’ salary updated.
  • Rohan (106) β†’ inserted as new row.

Check:

SELECT * FROM dev.bronze.emp;

5. πŸ”Ή Hard Deletes (Remove missing records)

Suppose source doesn’t have emp_id 103 and 105.
We want to delete them from target.

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN NOT MATCHED BY SOURCE THEN
  DELETE;

πŸ“Œ emp_id 103, 105 β†’ deleted.

Check:

SELECT * FROM dev.bronze.emp;

6. πŸ”Ή Soft Deletes (SCD1 style)

Instead of deleting, we mark records inactive.

Step 1: Add column

ALTER TABLE dev.bronze.emp ADD COLUMNS (is_active STRING);

πŸ‘‰ All existing rows get NULL by default.


Step 2: Modify MERGE for soft delete

MERGE INTO dev.bronze.emp AS e
USING dev.bronze.emp_updates AS u
ON e.emp_id = u.emp_id

WHEN MATCHED THEN
  UPDATE SET e.salary = u.salary, e.is_active = 'Y'

WHEN NOT MATCHED THEN
  INSERT (emp_id, emp_name, dept_code, salary, is_active)
  VALUES (u.emp_id, u.emp_name, u.dept_code, u.salary, 'Y')

WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET e.is_active = 'N';

πŸ“Œ Behavior:

  • Updated rows β†’ marked Y.
  • New rows β†’ inserted with Y.
  • Missing rows (e.g., 103, 105) β†’ marked as N instead of being deleted.

Check:

SELECT * FROM dev.bronze.emp;

7. πŸ”Ή Benefits of MERGE

  • Handles incremental loads seamlessly.
  • Supports real-time updates in Delta.
  • Allows hard deletes (completely remove) or soft deletes (mark inactive).
  • Enables auditable, consistent data pipelines.

βœ… Final Recap

  • MERGE (Upsert/SCD1) = Update existing + Insert new.
  • Can also delete missing records (WHEN NOT MATCHED BY SOURCE THEN DELETE).
  • For soft deletes, add a column (e.g., is_active) and mark inactive instead of removing.

Related Posts

Detailed Travel Experiences Shared by HolidayLandmark Forum Members

Introduction Embarking on a new journey is undeniably thrilling, yet the initial phase of piecing together a seamless travel plan can quickly transform into a chaotic exercise…

Read More

Transform Your Journey Using HolidayLandmark Local Travel Marketplace

Introduction The definition of a meaningful vacation is undergoing a massive shift. Modern adventurers are stepping away from rigid itineraries and crowded tourist traps, choosing instead to…

Read More

Understanding Version Control in DataOps Projects Essential Guide

Managing modern data systems feels like working on a high-speed train while laying down the tracks at the same time. Business demands shift by the hour. New…

Read More

Best Practices for Building Reliable Data Pipelines for Analytics

The data engineering team blames a modified upstream API schema, while the analytics team scrambles to fix a broken SQL script. DataOps provides a practical framework designed…

Read More

Complete DevOps Engineer Salary Roadmap for Beginners

Introduction The demand for skilled professionals who can bridge the gap between development and operations has never been higher. As businesses transition from legacy systems to cloud-native…

Read More

Complete DevOps Certification Roadmap For Strategic Career Progression

Introduction The global tech landscape is moving fast, and modern infrastructure is moving even faster. Monolithic, slow-moving systems are dead. Today, business agility depends entirely on distributed…

Read More