Databricks: hands-on tutorial for DLT Data Quality & Expectations

Here’s a complete, hands-on tutorial for DLT Data Quality & Expectations — including how to define rules, use warning / fail / drop actions, and monitor a DLT pipeline with SQL for observability. I’ve aligned the flow (and examples) to your timestamped outline so it reads like a video transcript you can paste into docs, slides, or a blog.


Introduction

In Delta Live Tables (DLT), Expectations are declarative data-quality rules you apply to each record as it flows through your pipeline. You can validate schemas, ranges, enumerations, nullability, referential integrity, and more. DLT then:

  • Counts how many rows passed/failed.
  • Decides per rule whether to allow, drop, or fail the pipeline.
  • Surfaces metrics and event logs you can query and visualize.

We’ll build a small pipeline with Orders and Customers to demonstrate:

  • Defining rules once (as a Python dict) and reusing them.
  • Applying rules in warning, fail, and drop modes.
  • Monitoring DLT event logs with SQL to track DQ failures over time.

Pre-reqs (quick):

  1. Create a DLT pipeline in Databricks; set Product edition to Advanced (Expectations need Advanced).
  2. Attach to a cluster policy that allows DLT (or use serverless DLT, if enabled).
  3. Use a Notebook language: Python for the examples below (SQL equivalents included later).

What are Expectations in Databricks DLT?

Expectation = named predicate + action.

  • Name: unique within the dataset (table/view)
  • Predicate: a boolean SQL expression (e.g., order_price > 0)
  • Action (per rule):
    • Warning (default): log failures; do not block or drop rows
    • Drop: exclude failing rows from the output
    • Fail: stop the pipeline (task fails)

Expectations can be attached to:

  • DLT Tables (@dlt.table)
  • DLT Views (@dlt.view)
  • Joined/derived datasets (so you can validate post-join quality, too)

How to define rules for Expectations in DLT?

Below, we’ll define reusable rule sets as Python dictionaries, then attach them with DLT decorators.

Example datasets (conceptual)

  • Orders: order_key, order_status in ('O','F','P'), order_price > 0
  • Customers: cust_id, mkt_segment NOT NULL

You can source data from Autoloader, streams, or tables. For clarity, we’ll read from bronze tables (or views) you maintain in your setup notebook.

Python: rule dictionaries

import dlt
from pyspark.sql import functions as F

# Reusable rule sets
order_rules = {
  "valid_order_status": "order_status IN ('O','F','P')",
  "valid_order_price":  "order_price > 0"
}

customer_rules = {
  "valid_mkt_segment": "mkt_segment IS NOT NULL"
}

Tips

  • Keep rule names snake_case and descriptive.
  • Expressions are SQL strings (not Python boolean expressions).
  • You can define one or many rules per dataset.

Expectation action: Warning

Warning is the default behavior when you attach expectations with @dlt.expect or @dlt.expect_all. DLT records failures in metrics and event logs, but keeps the rows.

Python: apply rules in warning mode

@dlt.table(name="orders_bronze")
@dlt.expect_all(order_rules)         # default = WARNING (aka "allow")
def orders_bronze():
    # Replace this with your real source (e.g., cloud_files or a table)
    return spark.table("source.orders_raw")

@dlt.view(name="customers_v")
@dlt.expect_all(customer_rules)      # default = WARNING
def customers_v():
    return spark.table("source.customers_raw")

What happens at runtime?

  • DLT counts failures for each rule and shows them under the dataset’s Data quality tab.
  • Rows still land in orders_bronze / are visible in downstream datasets.

Use case: Profiling a new source without disrupting downstream consumers. Warning helps you discover issues first.


Expectation action: Fail

Use fail when any failing row should stop your pipeline. Great for strict contracts and mission-critical pipelines.

Python: fail per rule set

@dlt.table(name="orders_bronze_fail_stop")
@dlt.expect_all_or_fail(order_rules)  # if any rule fails => task fails
def orders_bronze_fail_stop():
    return spark.table("source.orders_raw")

Alternative: per-rule fail

@dlt.table(name="orders_bronze_fail_rule")
@dlt.expect_or_fail("valid_order_status", "order_status IN ('O','F','P')")
@dlt.expect("valid_order_price", "order_price > 0")  # warning for this one
def orders_bronze_fail_rule():
    return spark.table("source.orders_raw")

Use case: SLA-backed pipelines where invalid data must never progress.


Expectation action: Drop

Use drop when you want to filter out bad rows but keep the pipeline running. This is common in schema drift or edge-case scenarios.

Python: drop per rule set

@dlt.table(name="orders_bronze_dropped")
@dlt.expect_all_or_drop(order_rules)   # failing rows are excluded
def orders_bronze_dropped():
    return spark.table("source.orders_raw")

Alternative: per-rule drop

@dlt.table(name="orders_bronze_mixed")
@dlt.expect_or_drop("valid_order_status", "order_status IN ('O','F','P')")
@dlt.expect("valid_order_price", "order_price > 0")  # warning for price
def orders_bronze_mixed():
    return spark.table("source.orders_raw")

Use case: allow the pipeline to continue while quarantining bad records (e.g., route them later for remediation).


Applying expectations on joins / views

You can attach expectations after transformations (e.g., joins), which validates post-join quality:

@dlt.view(name="orders_customers_join_v")
@dlt.expect_all({
  "post_join_status": "order_status IN ('O','F','P')",
  "post_join_price":  "order_price > 0",
  "post_join_segment": "mkt_segment IS NOT NULL"
})
def orders_customers_join_v():
    orders = dlt.read("orders_bronze_dropped")
    custs  = dlt.read_stream("customers_v")   # view can be read as stream
    return (orders.alias("o")
                  .join(custs.alias("c"), F.col("o.cust_id")==F.col("c.cust_id"), "left"))

SQL equivalents (DLT SQL)

If you prefer DLT SQL in the same pipeline:

-- Warning (default)
CREATE OR REFRESH STREAMING LIVE TABLE orders_bronze_sql
CONSTRAINT valid_order_status EXPECT (order_status IN ('O','F','P'))
CONSTRAINT valid_order_price  EXPECT (order_price > 0)
AS SELECT * FROM source.orders_raw;

-- Fail
CREATE OR REFRESH STREAMING LIVE TABLE orders_bronze_fail_sql
CONSTRAINT valid_order_status EXPECT (order_status IN ('O','F','P')) ON VIOLATION FAIL UPDATE
CONSTRAINT valid_order_price  EXPECT (order_price > 0)               ON VIOLATION FAIL UPDATE
AS SELECT * FROM source.orders_raw;

-- Drop
CREATE OR REFRESH STREAMING LIVE TABLE orders_bronze_drop_sql
CONSTRAINT valid_order_status EXPECT (order_status IN ('O','F','P')) ON VIOLATION DROP ROW
CONSTRAINT valid_order_price  EXPECT (order_price > 0)               ON VIOLATION DROP ROW
AS SELECT * FROM source.orders_raw;

Notes:

  • The SQL syntax uses CONSTRAINT name EXPECT (predicate) with ON VIOLATION DROP ROW | FAIL UPDATE.
  • If you omit ON VIOLATION, it behaves like warning (allow).

Inserting test records (quick setup idea)

In a separate setup notebook, you can seed edge cases to see each action:

-- example seed data; in practice, your bronze comes from Autoloader or ingestion jobs
CREATE TABLE IF NOT EXISTS source.orders_raw AS
SELECT * FROM VALUES
  (1001, 'O',  125.00,  1),
  (1002, 'NA',  30.00,  2),     -- invalid status
  (1003, 'P',  -10.00,  3),     -- invalid price
  (1004, NULL,  99.99,  4)      -- invalid status (NULL)
AS T(order_key, order_status, order_price, cust_id);

CREATE TABLE IF NOT EXISTS source.customers_raw AS
SELECT * FROM VALUES
  (1, 'SMALL BIZ'),
  (2, NULL),                    -- invalid segment
  (3, 'ENTERPRISE'),
  (4, 'CONSUMER')
AS T(cust_id, mkt_segment);

Re-run the pipeline to observe warning/fail/drop effects.


How to Monitor DLT pipelines (Observability)

DLT emits rich event logs you can query with SQL to build dashboards (SQL Warehouse, Databricks Dashboards, Lakeview, etc.).

Find the Pipeline ID: open your DLT pipeline → copy the ID from the UI.

1) Raw event log (table-valued function)

-- Replace with your pipeline id
SELECT * FROM event_log('<PIPELINE_ID>');

2) Create helpful views for reuse

-- Raw view (persisted in your schema)
CREATE OR REPLACE VIEW dlt_event_log_raw AS
SELECT * FROM event_log('<PIPELINE_ID>');

-- Latest update per dataset
CREATE OR REPLACE TEMP VIEW dlt_latest_updates AS
SELECT
  id:flow_definition.output_dataset AS dataset_name,
  MAX(timestamp)                     AS last_update_ts
FROM dlt_event_log_raw
WHERE event_type = 'flow_progress'
GROUP BY 1;

3) Data Quality (DQ) metrics over time

DLT emits a nested expectations structure inside events (commonly in flow_progress and expectations_status). This query summarizes failures by dataset and rule name:

WITH dq AS (
  SELECT
    timestamp,
    dataset = COALESCE(
      id:flow_definition.output_dataset::string,
      details:flow_progress:output_dataset::string
    ),
    explode(expectations) AS exp
  FROM dlt_event_log_raw
  WHERE event_type IN ('flow_progress','expectations_status')
),
exp_flat AS (
  SELECT
    timestamp,
    dataset,
    exp.key                                  AS rule_name,
    exp.value.passed_records::long           AS passed_records,
    exp.value.failed_records::long           AS failed_records,
    exp.value.status::string                 AS action -- ALLOW / DROP / FAIL
  FROM dq
)
SELECT
  dataset,
  rule_name,
  action,
  SUM(passed_records) AS total_passed,
  SUM(failed_records) AS total_failed,
  MAX(timestamp)      AS last_seen
FROM exp_flat
GROUP BY dataset, rule_name, action
ORDER BY dataset, rule_name;

What you’ll see:

  • action per rule (ALLOW ~ warning, DROP, FAIL)
  • passed/failed counts and last_seen timestamp
  • Use this to power SLO dashboards (e.g., “<1% failure over last 7 days”)

4) Pull the latest failure samples (optional)

To peek at failing rows, design your bronze with a quarantine table or DLT expectations with DROP + audit:

  • If using DROP, write failing rows to a side sink (custom logic in bronze).
  • Or, add trace columns (ingest time, source file, etc.) to speed triage.

5) Operational health

Summarize task status, update durations, and throughput:

-- Task runs and statuses
SELECT
  timestamp,
  details:flow_progress:metrics:num_output_rows::long AS out_rows,
  details:flow_progress:metrics:input_rows_processed::long AS in_rows,
  details:flow_progress:status::string AS status,
  id:flow_definition.output_dataset::string AS dataset
FROM dlt_event_log_raw
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC;

Build visuals:

  • Bar: failed_records by rule_name over time
  • Line: throughput (rows/sec) vs update
  • KPI: last update latency / time since last success

Practical guidance & patterns

  • Start in Warning → learn the data → tighten rules over time → promote to Drop or Fail where appropriate.
  • Name rules predictably: valid_<col>_<constraint> (valid_order_price_gt_zero).
  • Centralize rules (dicts or SQL macros) to reuse across datasets.
  • Post-join validations catch issues that only appear after enrichment.
  • Quarantine patterns: When using DROP, optionally write failures to a side table for remediation (e.g., with a stream that matches the inverse predicate).
  • Alerting: Schedule SQL alerts (e.g., failed_records > 0) or use Jobs to run health checks and send notifications (email/Webhook/Slack).

Troubleshooting

  • Wrong column name in a rule? DLT will fail early and show it in Event log. Fix the expression and re-run.
  • Nothing shows up in metrics? Ensure the dataset with expectations actually runs (new data or full refresh), and you’re viewing the right pipeline’s event log / catalog.

One-page checklist

  • Edition set to Advanced
  • Rule dicts defined and unit-tested on small slices
  • Expectations applied at bronze, post-join, silver as needed
  • Warning for discovery, Drop to quarantine, Fail for contracts
  • Event log views created; DQ summary query saved
  • Dashboards and alerts wired to your SLOs