Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:
- Deletion Vectors (DV) → Efficient deletes without rewriting files.
- Liquid Clustering → Automatic clustering for better query performance on high-cardinality data.
Let’s explore both in detail with examples you can run inside Databricks.
🔹 1. Introduction
- Traditional Delta operations like partitioning and Z-ordering help optimize queries.
- But:
- Partitioning struggles with high-cardinality columns.
- Z-ordering requires rewriting the data each time.
- Two new improvements solve this:
- Deletion Vectors for efficient deletes.
- Liquid Clustering for adaptive clustering without heavy rewrites.
🔹 2. What are Deletion Vectors?
Normally, when you delete or update a row inside a Delta table:
- The entire Parquet file containing that row must be rewritten.
- If the file has millions of rows, this is slow and costly.
👉 Deletion Vectors (DV) fix this:
- Instead of rewriting the file, Delta marks deleted rows with a flag.
- On subsequent reads, Delta skips flagged rows.
- Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.
📌 This significantly reduces I/O, time, and costs.
🔹 3. How Deletion Vectors Work (Step by Step with Example)
Step 1: Create a Delta table from CSV
-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
format => 'csv',
header => 'true'
);
-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;
Step 2: Verify if Deletion Vectors are enabled
DESCRIBE EXTENDED dev.bronze.sales;
Look at Table Properties:
delta.enableDeletionVectors = true→ DV is enabled.- If it’s
false, Delta rewrites files on delete.
Step 3: Disable DV and delete a row
ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);
DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';
-- Check table history
DESCRIBE HISTORY dev.bronze.sales;
📌 You’ll see:
- Files removed + new files added → full file rewrite happened.
numDeletionVectorsAdded = 0.
Step 4: Enable DV and delete again
ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);
DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';
DESCRIBE HISTORY dev.bronze.sales;
📌 You’ll see:
- No files rewritten.
numDeletionVectorsAdded > 0→ rows flagged as deleted.
Step 5: Clean flagged rows (optional)
OPTIMIZE dev.bronze.sales;
👉 This physically removes flagged rows from Parquet files during compaction.
✅ Key Takeaways – Deletion Vectors
- Faster deletes and updates.
- Reduced cost (no frequent rewrites).
- Permanent cleanup via
OPTIMIZE.
🔹 4. What is Liquid Clustering?
Partitioning and Z-ordering optimize queries but come with issues:
- Partitioning → inefficient with too many/too few partitions.
- Z-ordering → requires data rewrite for new inserts.
👉 Liquid Clustering is the solution:
- Automatically clusters new data based on chosen columns.
- No need for expensive rewrites.
- Handles skew, high-cardinality, and fast-growing datasets.
📌 Scenarios where Liquid Clustering shines:
- Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
- Data with imbalanced distributions.
- Fast incremental loads (streaming or batch).
🔹 5. How to Enable Liquid Clustering (Examples)
Step 1: Enable on an existing table
ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);
Check metadata:
DESCRIBE HISTORY dev.bronze.sales;
👉 Shows clusterBy: invoiceNo.
Step 2: Create a new clustered table
CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;
Check:
DESCRIBE EXTENDED dev.bronze.sales_cd;
👉 Clustering Columns: invoiceNo.
Step 3: Run a query with clustering column
SELECT *
FROM dev.bronze.sales
WHERE invoiceNo = '5464';
📌 Query runs much faster because Delta prunes files using clustering metadata.
Step 4: Disable clustering if not needed
ALTER TABLE dev.bronze.sales CLUSTER BY NONE;
✅ Key Takeaways – Liquid Clustering
- No need for manual
ZORDERmaintenance. - Handles new incoming data automatically.
- Works best on filter/join columns.
- Limited: clustering columns must be within the first 32 columns.
🔹 6. Final Recap
Deletion Vectors
- Efficient deletes/updates with row-level flags.
- Avoid rewriting large files.
- Permanent cleanup via
OPTIMIZE.
Liquid Clustering
- Adaptive clustering → incremental-friendly.
- Solves partitioning & Z-order limitations.
- Best for high-cardinality columns.
✅ Tutorial Summary
- Use Deletion Vectors when you need frequent deletes/updates.
- Use Liquid Clustering when your queries filter/join on high-cardinality columns.
- Together, they make Delta tables more scalable, efficient, and cost-effective.