Delta Live Tables (DLT) Internals & Incremental Load
Part 2: Add/Modify Columns | Rename Tables | Data Lineage
This tutorial walks step by step through advanced Delta Live Tables (DLT) features in Databricks. It is based on the transcript you provided but rewritten into a structured learning guide with examples you can try.
Introduction
In the previous tutorial, you built your first DLT pipeline with:
- Bronze tables: Raw ingest (orders, customers).
- Silver tables: Join + add audit column.
- Gold tables: Aggregation by market segment.
In this tutorial, you will:
- Explore DLT pipeline internals.
- Run an incremental load using streaming tables.
- Add or modify columns in a Materialized View.
- Use DLT debugging mode.
- Learn how to rename DLT tables.
- Peek into DLT dataset internals.
- Understand how streaming tables handle incremental data.
- Explore lineage in Unity Catalog.
DLT Pipeline Internals
- DLT is a declarative framework. You define transformations; Databricks handles execution.
- Every dataset created (streaming table, materialized view, or view) is tied to the pipeline via a Pipeline ID.
- If you delete the pipeline, all managed datasets are also deleted.
- Each run is called an update, with full logs, metrics, and lineage.
👉 You can inspect pipeline details in the Databricks Workflows → Delta Live Tables UI.
Incremental Load Using DLT
Streaming tables in DLT process only new data since the last checkpoint.
Example — Append incremental data
-- Insert 10,000 new rows into raw orders table
INSERT INTO dev.bronze.orders_raw
SELECT * FROM dev.samples_tpch.orders TABLESAMPLE (1 PERCENT) LIMIT 10000;
When you rerun the pipeline:
- orders_bronze (streaming) reads only 10k new rows.
- orders_silver and orders_aggregated_gold update accordingly.
- Old data is not reprocessed.
Add/Modify Columns in Materialized Views
Suppose the business asks for sum of total price in addition to the count.
Python DLT
import dlt
from pyspark.sql.functions import count as f_count, sum as f_sum
@dlt.table(
name="orders_aggregated_gold",
table_properties={"quality": "gold"},
comment="Aggregation by market segment with sum of total price"
)
def orders_aggregated_gold():
df = dlt.read("orders_silver")
return (df.groupBy("c_mktsegment")
.agg(
f_count("o_orderkey").alias("count_orders"),
f_sum("o_totalprice").alias("sum_totalprice")
))
👉 DLT automatically alters the schema of the table. No manual DDL needed.
Debugging Mode in DLT
- Use Development Mode for debugging.
- Attach the pipeline to the notebook (
Connect to Pipeline
). - Run Validate to catch errors (e.g., missing imports).
- Run Start directly from the notebook to see the DLT graph inline.
Example
If you forget to import sum
, DLT will fail validation and show the error in the event logs. Fix by:
from pyspark.sql.functions import sum as f_sum
Rename DLT Tables
DLT manages lifecycle automatically. To rename:
@dlt.table(
name="orders_silver", # changed from "joined_silver"
table_properties={"quality":"silver"},
comment="Renamed from joined_silver"
)
def orders_silver():
return dlt.read("join_v")
- Update all downstream references (e.g., Gold).
- Re-run pipeline → DLT creates
orders_silver
, removesjoined_silver
.
Internals of DLT Datasets
Behind the scenes:
- Managed datasets are Delta tables tied to your pipeline.
- Each dataset has a Pipeline ID in Unity Catalog → Details tab.
- Databricks creates internal catalogs and schemas (hidden) to store underlying physical tables and checkpoints.
- Streaming tables store state in checkpoint directories, ensuring exactly-once semantics.
How Streaming Tables Handle Incremental Data
- Streaming tables = Structured Streaming under the hood.
- Checkpoints track what has been processed.
- Only new rows since the last checkpoint are ingested.
- Ensures exactly-once processing.
👉 You can inspect the checkpoint location in your storage container (_delta_log
+ checkpoints
).
Lineage in Unity Catalog
Unity Catalog tracks table and column lineage automatically.
Example
- Open
dev.etl.orders_aggregated_gold
in Unity Catalog. - Click Lineage → See lineage graph.
- Expand upstream:
orders_silver
←orders_bronze
+customer_bronze
orders_raw
,customer_raw
Column-level lineage: click on count_orders
to trace back to o_orderkey
.
👉 Works for DLT and non-DLT tables.
✅ Key Takeaways
- DLT internals: Pipeline ID ties datasets to pipelines.
- Incremental load: Streaming tables read only new data via checkpoints.
- Schema evolution: Add/modify columns declaratively in code.
- Debugging: Use Development mode and Validate.
- Renaming: Update
name
in code → lifecycle handled automatically. - Internals: DLT uses hidden internal schemas and checkpointing.
- Lineage: Unity Catalog provides visual lineage across tables and columns.