Databricks: Delta Tables – Deletion Vectors & Liquid Clustering


Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:

  1. Deletion Vectors (DV) → Efficient deletes without rewriting files.
  2. Liquid Clustering → Automatic clustering for better query performance on high-cardinality data.

Let’s explore both in detail with examples you can run inside Databricks.


🔹 1. Introduction

  • Traditional Delta operations like partitioning and Z-ordering help optimize queries.
  • But:
    • Partitioning struggles with high-cardinality columns.
    • Z-ordering requires rewriting the data each time.
  • Two new improvements solve this:
    • Deletion Vectors for efficient deletes.
    • Liquid Clustering for adaptive clustering without heavy rewrites.

🔹 2. What are Deletion Vectors?

Normally, when you delete or update a row inside a Delta table:

  • The entire Parquet file containing that row must be rewritten.
  • If the file has millions of rows, this is slow and costly.

👉 Deletion Vectors (DV) fix this:

  • Instead of rewriting the file, Delta marks deleted rows with a flag.
  • On subsequent reads, Delta skips flagged rows.
  • Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.

📌 This significantly reduces I/O, time, and costs.


🔹 3. How Deletion Vectors Work (Step by Step with Example)

Step 1: Create a Delta table from CSV

-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
  'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
  format => 'csv',
  header => 'true'
);

-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;

Step 2: Verify if Deletion Vectors are enabled

DESCRIBE EXTENDED dev.bronze.sales;

Look at Table Properties:

  • delta.enableDeletionVectors = true → DV is enabled.
  • If it’s false, Delta rewrites files on delete.

Step 3: Disable DV and delete a row

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';

-- Check table history
DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

  • Files removed + new files added → full file rewrite happened.
  • numDeletionVectorsAdded = 0.

Step 4: Enable DV and delete again

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';

DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

  • No files rewritten.
  • numDeletionVectorsAdded > 0 → rows flagged as deleted.

Step 5: Clean flagged rows (optional)

OPTIMIZE dev.bronze.sales;

👉 This physically removes flagged rows from Parquet files during compaction.


Key Takeaways – Deletion Vectors

  • Faster deletes and updates.
  • Reduced cost (no frequent rewrites).
  • Permanent cleanup via OPTIMIZE.

🔹 4. What is Liquid Clustering?

Partitioning and Z-ordering optimize queries but come with issues:

  • Partitioning → inefficient with too many/too few partitions.
  • Z-ordering → requires data rewrite for new inserts.

👉 Liquid Clustering is the solution:

  • Automatically clusters new data based on chosen columns.
  • No need for expensive rewrites.
  • Handles skew, high-cardinality, and fast-growing datasets.

📌 Scenarios where Liquid Clustering shines:

  • Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
  • Data with imbalanced distributions.
  • Fast incremental loads (streaming or batch).

🔹 5. How to Enable Liquid Clustering (Examples)

Step 1: Enable on an existing table

ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);

Check metadata:

DESCRIBE HISTORY dev.bronze.sales;

👉 Shows clusterBy: invoiceNo.


Step 2: Create a new clustered table

CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;

Check:

DESCRIBE EXTENDED dev.bronze.sales_cd;

👉 Clustering Columns: invoiceNo.


Step 3: Run a query with clustering column

SELECT * 
FROM dev.bronze.sales
WHERE invoiceNo = '5464';

📌 Query runs much faster because Delta prunes files using clustering metadata.


Step 4: Disable clustering if not needed

ALTER TABLE dev.bronze.sales CLUSTER BY NONE;

Key Takeaways – Liquid Clustering

  • No need for manual ZORDER maintenance.
  • Handles new incoming data automatically.
  • Works best on filter/join columns.
  • Limited: clustering columns must be within the first 32 columns.

🔹 6. Final Recap

Deletion Vectors

  • Efficient deletes/updates with row-level flags.
  • Avoid rewriting large files.
  • Permanent cleanup via OPTIMIZE.

Liquid Clustering

  • Adaptive clustering → incremental-friendly.
  • Solves partitioning & Z-order limitations.
  • Best for high-cardinality columns.

✅ Tutorial Summary

  • Use Deletion Vectors when you need frequent deletes/updates.
  • Use Liquid Clustering when your queries filter/join on high-cardinality columns.
  • Together, they make Delta tables more scalable, efficient, and cost-effective.

Related Posts

Ultimate Career Guide: Best Practices for Entry-Level DataOps Professionals

Introduction Data is now one of the most important assets for modern organizations. Companies depend on data pipelines, analytics dashboards, reporting systems, cloud platforms, and automated workflows…

Read More

Understanding Fundamental Analysis of Stocks for Long Term Equity Investing

Introduction Stepping into the financial world can feel overwhelming, but securing high-quality stock market education is the ultimate way to build long-term wealth. For individuals starting their…

Read More

A Complete Review of the Top Rank Tracking Tools for Local & Global Scale

To win in the modern digital landscape, visibility is everything. Growing brands and busy agencies frequently struggle to balance keyword tracking, technical audits, content creation, creator outreach,…

Read More

Modern DevOps Consulting for Cloud and Kubernetes Success

Introduction Digital‑first businesses are under intense pressure to ship faster, stay secure, and scale reliably across complex multi‑cloud environments. Traditional ways of building and operating software cannot…

Read More

Enterprise DevOps: A Beginner Guide to Scaling IT

Introduction Modern enterprises face the monumental challenge of delivering software at breakneck speeds without sacrificing infrastructure stability. Relying on isolated development and operations teams is no longer…

Read More

Introduction to Automation Testing in DataOps: A Beginner’s Guide

Introduction In modern data engineering, building a data pipeline is only half the battle. The real challenge lies in ensuring that the data flowing through these pipelines…

Read More