{"id":785,"date":"2025-08-22T13:33:54","date_gmt":"2025-08-22T13:33:54","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=785"},"modified":"2025-08-22T13:33:55","modified_gmt":"2025-08-22T13:33:55","slug":"databricks-delta-tables-deletion-vectors-liquid-clustering","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/databricks-delta-tables-deletion-vectors-liquid-clustering\/","title":{"rendered":"Databricks: Delta Tables \u2013 Deletion Vectors &amp; Liquid Clustering"},"content":{"rendered":"\n<p><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><\/h1>\n\n\n\n<p>Delta Lake keeps improving with features that <strong>optimize performance and storage<\/strong>. Two of the most important recent features are:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Deletion Vectors (DV)<\/strong> \u2192 Efficient deletes without rewriting files.<\/li>\n\n\n\n<li><strong>Liquid Clustering<\/strong> \u2192 Automatic clustering for better query performance on high-cardinality data.<\/li>\n<\/ol>\n\n\n\n<p>Let\u2019s explore both in detail with <strong>examples you can run inside Databricks<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 1. Introduction<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Traditional Delta operations like <strong>partitioning<\/strong> and <strong>Z-ordering<\/strong> help optimize queries.<\/li>\n\n\n\n<li>But:\n<ul class=\"wp-block-list\">\n<li>Partitioning struggles with <strong>high-cardinality columns<\/strong>.<\/li>\n\n\n\n<li>Z-ordering requires <strong>rewriting the data<\/strong> each time.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Two new improvements solve this:\n<ul class=\"wp-block-list\">\n<li><strong>Deletion Vectors<\/strong> for efficient deletes.<\/li>\n\n\n\n<li><strong>Liquid Clustering<\/strong> for adaptive clustering without heavy rewrites.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 2. What are Deletion Vectors?<\/h2>\n\n\n\n<p>Normally, when you <strong>delete or update<\/strong> a row inside a Delta table:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The entire <strong>Parquet file<\/strong> containing that row must be rewritten.<\/li>\n\n\n\n<li>If the file has <strong>millions of rows<\/strong>, this is slow and costly.<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udc49 <strong>Deletion Vectors (DV)<\/strong> fix this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instead of rewriting the file, Delta <strong>marks deleted rows with a flag<\/strong>.<\/li>\n\n\n\n<li>On subsequent reads, Delta <strong>skips flagged rows<\/strong>.<\/li>\n\n\n\n<li>Later, when you run <strong>OPTIMIZE<\/strong> (maintenance), those flagged rows are permanently removed.<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udccc This significantly reduces <strong>I\/O, time, and costs<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 3. How Deletion Vectors Work (Step by Step with Example)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a Delta table from CSV<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>-- Read a CSV into a table\nCREATE TABLE dev.bronze.sales\nAS SELECT * FROM read_files(\n  'dbfs:\/databricks-datasets\/retail-org\/sales_orders.csv',\n  format =&gt; 'csv',\n  header =&gt; 'true'\n);\n\n-- Check data\nSELECT * FROM dev.bronze.sales LIMIT 5;\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Verify if Deletion Vectors are enabled<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>DESCRIBE EXTENDED dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>Look at <strong>Table Properties<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>delta.enableDeletionVectors = true<\/code> \u2192 DV is enabled.<\/li>\n\n\n\n<li>If it\u2019s <code>false<\/code>, Delta rewrites files on delete.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Disable DV and delete a row<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE dev.bronze.sales\nSET TBLPROPERTIES ('delta.enableDeletionVectors' = false);\n\nDELETE FROM dev.bronze.sales WHERE invoiceNo = '5464';\n\n-- Check table history\nDESCRIBE HISTORY dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udccc You\u2019ll see:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Files removed + new files added<\/strong> \u2192 full file rewrite happened.<\/li>\n\n\n\n<li><code>numDeletionVectorsAdded = 0<\/code>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Enable DV and delete again<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE dev.bronze.sales\nSET TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\nDELETE FROM dev.bronze.sales WHERE invoiceNo = '5468';\n\nDESCRIBE HISTORY dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udccc You\u2019ll see:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>No files rewritten<\/strong>.<\/li>\n\n\n\n<li><code>numDeletionVectorsAdded > 0<\/code> \u2192 rows flagged as deleted.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Clean flagged rows (optional)<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>OPTIMIZE dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udc49 This physically removes flagged rows from Parquet files during compaction.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>\u2705 <strong>Key Takeaways \u2013 Deletion Vectors<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Faster deletes and updates.<\/li>\n\n\n\n<li>Reduced cost (no frequent rewrites).<\/li>\n\n\n\n<li>Permanent cleanup via <code>OPTIMIZE<\/code>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 4. What is Liquid Clustering?<\/h2>\n\n\n\n<p>Partitioning and Z-ordering optimize queries but come with issues:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Partitioning \u2192 inefficient with <strong>too many\/too few partitions<\/strong>.<\/li>\n\n\n\n<li>Z-ordering \u2192 requires <strong>data rewrite<\/strong> for new inserts.<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udc49 <strong>Liquid Clustering<\/strong> is the solution:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automatically <strong>clusters new data<\/strong> based on chosen columns.<\/li>\n\n\n\n<li>No need for expensive rewrites.<\/li>\n\n\n\n<li>Handles <strong>skew, high-cardinality, and fast-growing datasets<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p>\ud83d\udccc Scenarios where Liquid Clustering shines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Queries with <strong>filters on high-cardinality columns<\/strong> (e.g., userId, invoiceId).<\/li>\n\n\n\n<li>Data with <strong>imbalanced distributions<\/strong>.<\/li>\n\n\n\n<li>Fast incremental loads (streaming or batch).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 5. How to Enable Liquid Clustering (Examples)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Enable on an existing table<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE dev.bronze.sales\nCLUSTER BY (invoiceNo);\n<\/code><\/pre>\n\n\n\n<p>Check metadata:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DESCRIBE HISTORY dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udc49 Shows <code>clusterBy: invoiceNo<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create a new clustered table<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE dev.bronze.sales_cd\nCLUSTER BY (invoiceNo)\nAS SELECT * FROM dev.bronze.sales;\n<\/code><\/pre>\n\n\n\n<p>Check:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DESCRIBE EXTENDED dev.bronze.sales_cd;\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udc49 <code>Clustering Columns: invoiceNo<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Run a query with clustering column<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT * \nFROM dev.bronze.sales\nWHERE invoiceNo = '5464';\n<\/code><\/pre>\n\n\n\n<p>\ud83d\udccc Query runs <strong>much faster<\/strong> because Delta prunes files using clustering metadata.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Disable clustering if not needed<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>ALTER TABLE dev.bronze.sales CLUSTER BY NONE;\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>\u2705 <strong>Key Takeaways \u2013 Liquid Clustering<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>No need for manual <code>ZORDER<\/code> maintenance.<\/li>\n\n\n\n<li>Handles new incoming data <strong>automatically<\/strong>.<\/li>\n\n\n\n<li>Works best on <strong>filter\/join columns<\/strong>.<\/li>\n\n\n\n<li>Limited: clustering columns must be within the <strong>first 32 columns<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd39 6. Final Recap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Deletion Vectors<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Efficient deletes\/updates with <strong>row-level flags<\/strong>.<\/li>\n\n\n\n<li>Avoid rewriting large files.<\/li>\n\n\n\n<li>Permanent cleanup via <code>OPTIMIZE<\/code>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Liquid Clustering<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Adaptive clustering \u2192 <strong>incremental-friendly<\/strong>.<\/li>\n\n\n\n<li>Solves partitioning &amp; Z-order limitations.<\/li>\n\n\n\n<li>Best for <strong>high-cardinality columns<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">\u2705 Tutorial Summary<\/h1>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Deletion Vectors<\/strong> when you need frequent deletes\/updates.<\/li>\n\n\n\n<li>Use <strong>Liquid Clustering<\/strong> when your queries filter\/join on high-cardinality columns.<\/li>\n\n\n\n<li>Together, they make Delta tables more <strong>scalable, efficient, and cost-effective<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Delta Lake keeps improving with features that optimize performance and storage. Two of the most important recent features are: Let\u2019s explore both in detail with examples you&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-785","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/785","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=785"}],"version-history":[{"count":1,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/785\/revisions"}],"predecessor-version":[{"id":786,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/785\/revisions\/786"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=785"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=785"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=785"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}