Rajesh Kumar August 22, 2025 0

Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:

Deletion Vectors (DV) → Efficient deletes without rewriting files.
Liquid Clustering → Automatic clustering for better query performance on high-cardinality data.

Let’s explore both in detail with examples you can run inside Databricks.

🔹 1. Introduction

Traditional Delta operations like partitioning and Z-ordering help optimize queries.
But:
- Partitioning struggles with high-cardinality columns.
- Z-ordering requires rewriting the data each time.
Two new improvements solve this:
- Deletion Vectors for efficient deletes.
- Liquid Clustering for adaptive clustering without heavy rewrites.

🔹 2. What are Deletion Vectors?

Normally, when you delete or update a row inside a Delta table:

The entire Parquet file containing that row must be rewritten.
If the file has millions of rows, this is slow and costly.

👉 Deletion Vectors (DV) fix this:

Instead of rewriting the file, Delta marks deleted rows with a flag.
On subsequent reads, Delta skips flagged rows.
Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.

📌 This significantly reduces I/O, time, and costs.

🔹 3. How Deletion Vectors Work (Step by Step with Example)

Step 1: Create a Delta table from CSV

-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
  'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
  format => 'csv',
  header => 'true'
);

-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;

Step 2: Verify if Deletion Vectors are enabled

DESCRIBE EXTENDED dev.bronze.sales;

Look at Table Properties:

delta.enableDeletionVectors = true → DV is enabled.
If it’s false, Delta rewrites files on delete.

Step 3: Disable DV and delete a row

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';

-- Check table history
DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

Files removed + new files added → full file rewrite happened.
numDeletionVectorsAdded = 0.

Step 4: Enable DV and delete again

ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';

DESCRIBE HISTORY dev.bronze.sales;

📌 You’ll see:

No files rewritten.
numDeletionVectorsAdded > 0 → rows flagged as deleted.

Step 5: Clean flagged rows (optional)

OPTIMIZE dev.bronze.sales;

👉 This physically removes flagged rows from Parquet files during compaction.

✅ Key Takeaways – Deletion Vectors

Faster deletes and updates.
Reduced cost (no frequent rewrites).
Permanent cleanup via OPTIMIZE.

🔹 4. What is Liquid Clustering?

Partitioning and Z-ordering optimize queries but come with issues:

Partitioning → inefficient with too many/too few partitions.
Z-ordering → requires data rewrite for new inserts.

👉 Liquid Clustering is the solution:

Automatically clusters new data based on chosen columns.
No need for expensive rewrites.
Handles skew, high-cardinality, and fast-growing datasets.

📌 Scenarios where Liquid Clustering shines:

Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
Data with imbalanced distributions.
Fast incremental loads (streaming or batch).

🔹 5. How to Enable Liquid Clustering (Examples)

Step 1: Enable on an existing table

ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);

Check metadata:

DESCRIBE HISTORY dev.bronze.sales;

👉 Shows clusterBy: invoiceNo.

Step 2: Create a new clustered table

CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;

Check:

DESCRIBE EXTENDED dev.bronze.sales_cd;

👉 Clustering Columns: invoiceNo.

Step 3: Run a query with clustering column

SELECT * 
FROM dev.bronze.sales
WHERE invoiceNo = '5464';

📌 Query runs much faster because Delta prunes files using clustering metadata.

Step 4: Disable clustering if not needed

ALTER TABLE dev.bronze.sales CLUSTER BY NONE;

✅ Key Takeaways – Liquid Clustering

No need for manual ZORDER maintenance.
Handles new incoming data automatically.
Works best on filter/join columns.
Limited: clustering columns must be within the first 32 columns.

🔹 6. Final Recap

Deletion Vectors

Efficient deletes/updates with row-level flags.
Avoid rewriting large files.
Permanent cleanup via OPTIMIZE.

Liquid Clustering

Adaptive clustering → incremental-friendly.
Solves partitioning & Z-order limitations.
Best for high-cardinality columns.

✅ Tutorial Summary

Use Deletion Vectors when you need frequent deletes/updates.
Use Liquid Clustering when your queries filter/join on high-cardinality columns.
Together, they make Delta tables more scalable, efficient, and cost-effective.

Category:

Uncategorized

Databricks: Delta Tables – Deletion Vectors & Liquid Clustering