Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are:
- Deletion Vectors (DV) β Efficient deletes without rewriting files.
- Liquid Clustering β Automatic clustering for better query performance on high-cardinality data.
Letβs explore both in detail with examples you can run inside Databricks.
πΉ 1. Introduction
- Traditional Delta operations like partitioning and Z-ordering help optimize queries.
- But:
- Partitioning struggles with high-cardinality columns.
- Z-ordering requires rewriting the data each time.
- Two new improvements solve this:
- Deletion Vectors for efficient deletes.
- Liquid Clustering for adaptive clustering without heavy rewrites.
πΉ 2. What are Deletion Vectors?
Normally, when you delete or update a row inside a Delta table:
- The entire Parquet file containing that row must be rewritten.
- If the file has millions of rows, this is slow and costly.
π Deletion Vectors (DV) fix this:
- Instead of rewriting the file, Delta marks deleted rows with a flag.
- On subsequent reads, Delta skips flagged rows.
- Later, when you run OPTIMIZE (maintenance), those flagged rows are permanently removed.
π This significantly reduces I/O, time, and costs.
πΉ 3. How Deletion Vectors Work (Step by Step with Example)
Step 1: Create a Delta table from CSV
-- Read a CSV into a table
CREATE TABLE dev.bronze.sales
AS SELECT * FROM read_files(
'dbfs:/databricks-datasets/retail-org/sales_orders.csv',
format => 'csv',
header => 'true'
);
-- Check data
SELECT * FROM dev.bronze.sales LIMIT 5;
Step 2: Verify if Deletion Vectors are enabled
DESCRIBE EXTENDED dev.bronze.sales;
Look at Table Properties:
delta.enableDeletionVectors = true
β DV is enabled.- If itβs
false
, Delta rewrites files on delete.
Step 3: Disable DV and delete a row
ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);
DELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';
-- Check table history
DESCRIBE HISTORY dev.bronze.sales;
π Youβll see:
- Files removed + new files added β full file rewrite happened.
numDeletionVectorsAdded = 0
.
Step 4: Enable DV and delete again
ALTER TABLE dev.bronze.sales
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);
DELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';
DESCRIBE HISTORY dev.bronze.sales;
π Youβll see:
- No files rewritten.
numDeletionVectorsAdded > 0
β rows flagged as deleted.
Step 5: Clean flagged rows (optional)
OPTIMIZE dev.bronze.sales;
π This physically removes flagged rows from Parquet files during compaction.
β Key Takeaways β Deletion Vectors
- Faster deletes and updates.
- Reduced cost (no frequent rewrites).
- Permanent cleanup via
OPTIMIZE
.
πΉ 4. What is Liquid Clustering?
Partitioning and Z-ordering optimize queries but come with issues:
- Partitioning β inefficient with too many/too few partitions.
- Z-ordering β requires data rewrite for new inserts.
π Liquid Clustering is the solution:
- Automatically clusters new data based on chosen columns.
- No need for expensive rewrites.
- Handles skew, high-cardinality, and fast-growing datasets.
π Scenarios where Liquid Clustering shines:
- Queries with filters on high-cardinality columns (e.g., userId, invoiceId).
- Data with imbalanced distributions.
- Fast incremental loads (streaming or batch).
πΉ 5. How to Enable Liquid Clustering (Examples)
Step 1: Enable on an existing table
ALTER TABLE dev.bronze.sales
CLUSTER BY (invoiceNo);
Check metadata:
DESCRIBE HISTORY dev.bronze.sales;
π Shows clusterBy: invoiceNo
.
Step 2: Create a new clustered table
CREATE TABLE dev.bronze.sales_cd
CLUSTER BY (invoiceNo)
AS SELECT * FROM dev.bronze.sales;
Check:
DESCRIBE EXTENDED dev.bronze.sales_cd;
π Clustering Columns: invoiceNo
.
Step 3: Run a query with clustering column
SELECT *
FROM dev.bronze.sales
WHERE invoiceNo = '5464';
π Query runs much faster because Delta prunes files using clustering metadata.
Step 4: Disable clustering if not needed
ALTER TABLE dev.bronze.sales CLUSTER BY NONE;
β Key Takeaways β Liquid Clustering
- No need for manual
ZORDER
maintenance. - Handles new incoming data automatically.
- Works best on filter/join columns.
- Limited: clustering columns must be within the first 32 columns.
πΉ 6. Final Recap
Deletion Vectors
- Efficient deletes/updates with row-level flags.
- Avoid rewriting large files.
- Permanent cleanup via
OPTIMIZE
.
Liquid Clustering
- Adaptive clustering β incremental-friendly.
- Solves partitioning & Z-order limitations.
- Best for high-cardinality columns.
β Tutorial Summary
- Use Deletion Vectors when you need frequent deletes/updates.
- Use Liquid Clustering when your queries filter/join on high-cardinality columns.
- Together, they make Delta tables more scalable, efficient, and cost-effective.