Databricks: Delta Tables – Deletion Vectors & Liquid Clustering


Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:

  1. Deletion Vectors (DV) β†’ Efficient deletes without rewriting files.
  2. Liquid Clustering β†’ Automatic clustering for better query performance on high-cardinality data.

Let’s explore both in detail with examples you can run inside Databricks.


πŸ”Ή 1. Introduction

  • Traditional Delta operations like partitioning and Z-ordering help optimize queries.
  • But:
    • Partitioning struggles with high-cardinality columns.
    • Z-ordering requires rewriting the data each time.
  • Two new improvements solve this:
    • Deletion Vectors for efficient deletes.
    • Liquid Clustering for adaptive clustering without heavy rewrites.

πŸ”Ή 2. What are Deletion Vectors?

Normally, when you delete or update a row inside a Delta table:

  • The entire Parquet file containing that row must be rewritten.
  • If the file has millions of rows, this is slow and costly.

πŸ‘‰ Deletion Vectors (DV) fix this:

  • Instead of rewriting the file, Delta marks deleted rows with a flag.
  • On subsequent reads, Delta skips flagged rows.
  • Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.

πŸ“Œ This significantly reduces I/O, time, and costs.


πŸ”Ή 3. How Deletion Vectors Work (Step by Step with Example)

Step 1: Create a Delta table from CSV

-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
  'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
  format => 'csv',
  header => 'true'
);

-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;

Step 2: Verify if Deletion Vectors are enabled

DESCRIBE EXTENDED dev.bronze.sales;

Look at Table Properties:

  • delta.enableDeletionVectors = true β†’ DV is enabled.
  • If it’s false, Delta rewrites files on delete.

Step 3: Disable DV and delete a row

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';

-- Check table history
DESCRIBE HISTORY dev.bronze.sales;

πŸ“Œ You’ll see:

  • Files removed + new files added β†’ full file rewrite happened.
  • numDeletionVectorsAdded = 0.

Step 4: Enable DV and delete again

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';

DESCRIBE HISTORY dev.bronze.sales;

πŸ“Œ You’ll see:

  • No files rewritten.
  • numDeletionVectorsAdded > 0 β†’ rows flagged as deleted.

Step 5: Clean flagged rows (optional)

OPTIMIZE dev.bronze.sales;

πŸ‘‰ This physically removes flagged rows from Parquet files during compaction.


βœ… Key Takeaways – Deletion Vectors

  • Faster deletes and updates.
  • Reduced cost (no frequent rewrites).
  • Permanent cleanup via OPTIMIZE.

πŸ”Ή 4. What is Liquid Clustering?

Partitioning and Z-ordering optimize queries but come with issues:

  • Partitioning β†’ inefficient with too many/too few partitions.
  • Z-ordering β†’ requires data rewrite for new inserts.

πŸ‘‰ Liquid Clustering is the solution:

  • Automatically clusters new data based on chosen columns.
  • No need for expensive rewrites.
  • Handles skew, high-cardinality, and fast-growing datasets.

πŸ“Œ Scenarios where Liquid Clustering shines:

  • Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
  • Data with imbalanced distributions.
  • Fast incremental loads (streaming or batch).

πŸ”Ή 5. How to Enable Liquid Clustering (Examples)

Step 1: Enable on an existing table

ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);

Check metadata:

DESCRIBE HISTORY dev.bronze.sales;

πŸ‘‰ Shows clusterBy: invoiceNo.


Step 2: Create a new clustered table

CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;

Check:

DESCRIBE EXTENDED dev.bronze.sales_cd;

πŸ‘‰ Clustering Columns: invoiceNo.


Step 3: Run a query with clustering column

SELECT * 
FROM dev.bronze.sales
WHERE invoiceNo = '5464';

πŸ“Œ Query runs much faster because Delta prunes files using clustering metadata.


Step 4: Disable clustering if not needed

ALTER TABLE dev.bronze.sales CLUSTER BY NONE;

βœ… Key Takeaways – Liquid Clustering

  • No need for manual ZORDER maintenance.
  • Handles new incoming data automatically.
  • Works best on filter/join columns.
  • Limited: clustering columns must be within the first 32 columns.

πŸ”Ή 6. Final Recap

Deletion Vectors

  • Efficient deletes/updates with row-level flags.
  • Avoid rewriting large files.
  • Permanent cleanup via OPTIMIZE.

Liquid Clustering

  • Adaptive clustering β†’ incremental-friendly.
  • Solves partitioning & Z-order limitations.
  • Best for high-cardinality columns.

βœ… Tutorial Summary

  • Use Deletion Vectors when you need frequent deletes/updates.
  • Use Liquid Clustering when your queries filter/join on high-cardinality columns.
  • Together, they make Delta tables more scalable, efficient, and cost-effective.