Databricks: Delta Tables – Deletion Vectors & Liquid Clustering


Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:

  1. Deletion Vectors (DV) → Efficient deletes without rewriting files.
  2. Liquid Clustering → Automatic clustering for better query performance on high-cardinality data.

Let’s explore both in detail with examples you can run inside Databricks.


🔹 1. Introduction

  • Traditional Delta operations like partitioning and Z-ordering help optimize queries.
  • But:
    • Partitioning struggles with high-cardinality columns.
    • Z-ordering requires rewriting the data each time.
  • Two new improvements solve this:
    • Deletion Vectors for efficient deletes.
    • Liquid Clustering for adaptive clustering without heavy rewrites.

🔹 2. What are Deletion Vectors?

Normally, when you delete or update a row inside a Delta table:

  • The entire Parquet file containing that row must be rewritten.
  • If the file has millions of rows, this is slow and costly.

👉 Deletion Vectors (DV) fix this:

  • Instead of rewriting the file, Delta marks deleted rows with a flag.
  • On subsequent reads, Delta skips flagged rows.
  • Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.

📌 This significantly reduces I/O, time, and costs.


🔹 3. How Deletion Vectors Work (Step by Step with Example)

Step 1: Create a Delta table from CSV

-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
  'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
  format => 'csv',
  header => 'true'
);

-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;

Step 2: Verify if Deletion Vectors are enabled

DESCRIBE EXTENDED dev.bronze.sales;

Look at Table Properties:

  • delta.enableDeletionVectors = true → DV is enabled.
  • If it’s false, Delta rewrites files on delete.

Step 3: Disable DV and delete a row

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';

-- Check table history
DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

  • Files removed + new files added → full file rewrite happened.
  • numDeletionVectorsAdded = 0.

Step 4: Enable DV and delete again

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';

DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

  • No files rewritten.
  • numDeletionVectorsAdded > 0 → rows flagged as deleted.

Step 5: Clean flagged rows (optional)

OPTIMIZE dev.bronze.sales;

👉 This physically removes flagged rows from Parquet files during compaction.


Key Takeaways – Deletion Vectors

  • Faster deletes and updates.
  • Reduced cost (no frequent rewrites).
  • Permanent cleanup via OPTIMIZE.

🔹 4. What is Liquid Clustering?

Partitioning and Z-ordering optimize queries but come with issues:

  • Partitioning → inefficient with too many/too few partitions.
  • Z-ordering → requires data rewrite for new inserts.

👉 Liquid Clustering is the solution:

  • Automatically clusters new data based on chosen columns.
  • No need for expensive rewrites.
  • Handles skew, high-cardinality, and fast-growing datasets.

📌 Scenarios where Liquid Clustering shines:

  • Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
  • Data with imbalanced distributions.
  • Fast incremental loads (streaming or batch).

🔹 5. How to Enable Liquid Clustering (Examples)

Step 1: Enable on an existing table

ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);

Check metadata:

DESCRIBE HISTORY dev.bronze.sales;

👉 Shows clusterBy: invoiceNo.


Step 2: Create a new clustered table

CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;

Check:

DESCRIBE EXTENDED dev.bronze.sales_cd;

👉 Clustering Columns: invoiceNo.


Step 3: Run a query with clustering column

SELECT * 
FROM dev.bronze.sales
WHERE invoiceNo = '5464';

📌 Query runs much faster because Delta prunes files using clustering metadata.


Step 4: Disable clustering if not needed

ALTER TABLE dev.bronze.sales CLUSTER BY NONE;

Key Takeaways – Liquid Clustering

  • No need for manual ZORDER maintenance.
  • Handles new incoming data automatically.
  • Works best on filter/join columns.
  • Limited: clustering columns must be within the first 32 columns.

🔹 6. Final Recap

Deletion Vectors

  • Efficient deletes/updates with row-level flags.
  • Avoid rewriting large files.
  • Permanent cleanup via OPTIMIZE.

Liquid Clustering

  • Adaptive clustering → incremental-friendly.
  • Solves partitioning & Z-order limitations.
  • Best for high-cardinality columns.

✅ Tutorial Summary

  • Use Deletion Vectors when you need frequent deletes/updates.
  • Use Liquid Clustering when your queries filter/join on high-cardinality columns.
  • Together, they make Delta tables more scalable, efficient, and cost-effective.

Related Posts

Strategic Cloud Financial Management With Certified FinOps Professional Training

Introduction The Certified FinOps Professional program is a transformative milestone for any engineer or manager looking to master the intersection of finance, technology, and business operations. This…

Read More

Professional Certified FinOps Engineer improves financial performance visibility systems

Introduction In the modern landscape of cloud infrastructure, technical expertise alone is no longer sufficient to drive enterprise success. The Certified FinOps Engineer program has emerged as…

Read More

Complete Cloud Financial Management Guide for Certified FinOps Manager

Introduction The Certified FinOps Manager program is designed to bridge the widening gap between cloud engineering and financial accountability. As cloud environments become more complex, organizations require…

Read More

Industry Ready FinOps Knowledge Through Certified FinOps Architect Program

Introduction The Certified FinOps Architect certification is designed to help professionals bridge the gap between cloud financial management and operational efficiency. This guide is tailored for working…

Read More

Advance Your Data Management Career with CDOM – Certified DataOps Manager

The CDOM – Certified DataOps Manager is a breakthrough certification designed for professionals who want to master the intersection of data engineering and operational agility. This guide…

Read More

Future focused learning with CDOA – Certified DataOps Architect certification

Introduction The CDOA – Certified DataOps Architect is a professional designed to bridge the gap between data engineering and operational excellence. This guide is written for engineers…

Read More