Databricks: Databricks Workflows (Jobs, Tasks, Passing Values, If/Else, Re-runs, and Loops)


1. 🔹 Introduction to Workflows

  • Databricks Workflow (Job) = a pipeline of tasks (notebooks, scripts, SQL, pipelines, etc.).
  • Use cases: ETL orchestration, data quality checks, ML pipelines, conditional branching.
  • Each job = multiple tasks with dependencies, parameters, retries, schedules, etc.

2. 🔹 Jobs UI Overview

When creating a job:

  • Task types: Notebook, Python script, Wheel, JAR, SQL, dbt, Spark submit, If/Else, For Each.
  • Cluster: Use job clusters (terminate after run) or all-purpose clusters.
  • Parameters: Pass values via widgets (dbutils.widgets.get() in notebooks).
  • Notifications: Configure success/failure emails or alerts.
  • Retries & Timeouts: Control job resiliency.
  • Schedule/Trigger: Run once, on schedule, or triggered by events.
  • Permissions: Control who can run/edit/manage jobs.
  • Advanced: Queueing, max concurrent runs.

3. 🔹 Creating a Job (Example: Process Employee Data)

Workflow:

  1. Get Day (extract current day name from date).
  2. Check if Sunday (If/Else branch).
  3. If True → Process data by department.
  4. If False → Print “Not Sunday”.

Notebook Setup

  • Notebook 1: Get Day dbutils.widgets.text("input_date", "") input_date = dbutils.widgets.get("input_date") # Get day of week input_day = spark.sql(f""" SELECT date_format(to_timestamp('{input_date}', "yyyy-MM-dd'T'HH:mm:ss"), 'E') as day """).collect()[0].day # Set task value dbutils.jobs.taskValues.set(key="input_day", value=input_day)
  • Notebook 2: Process Data (department-based ETL).
  • Notebook 3: Else branch (just print day). input_day = dbutils.jobs.taskValues.get(taskKey="01_set_day", key="input_day") print(f"Today is {input_day}, skipping processing.")

4. 🔹 Passing Values Between Tasks

  • Use dbutils.jobs.taskValues.set() in producer task.
  • Retrieve with dbutils.jobs.taskValues.get(taskKey, key) in consumer task.

✅ Example: Pass “Sunday” check from Notebook 1 → If/Else task.


5. 🔹 Conditional (If/Else) Tasks

  • Add If/Else task in Workflow.
  • Condition:
    • Value from task output = “Sunday”.
    • Operator = equals.
  • True branch → Run process notebook.
  • False branch → Run else notebook.

6. 🔹 Re-run Failed Jobs

  • Go to Run history → Repair run.
  • Select failed tasks → re-execute only those.
  • Saves time (no need to rerun entire pipeline).

7. 🔹 Override Parameters at Runtime

  • Use Run with different parameters in UI.
  • Example: Override input_date to "2024-10-27T13:00:00" → Forces workflow to evaluate as Sunday.

8. 🔹 For Each Loop in Workflows

  • Wrap a task (e.g., Process Department Data) inside a For Each loop.
  • Provide array of values (static or dynamic). ["sales", "office"]
  • Each loop iteration passes one value → notebook parameter.

Example notebook parameter setup:

dbutils.widgets.text("department", "")
department = dbutils.widgets.get("department")

print(f"Processing department: {department}")

💡 This runs the same task multiple times (parallel/sequential) for each department.


9. 🔹 Best Practices

  • ✅ Always use job clusters (auto-terminate) → cost saving.
  • ✅ Centralize parameters at job level, override at task level when needed.
  • ✅ Use taskValues for cross-task communication.
  • ✅ Use If/Else for conditional ETL or SLA workflows.
  • ✅ Use For Each for department-wise ETL, multi-source ingestion, or model training per dataset.
  • ✅ Leverage repair runs instead of restarting full pipelines.

10. 🔹 Summary

  • Jobs orchestrate pipelines.
  • Tasks define execution units (Notebook, Python, SQL, etc.).
  • Parameters & TaskValues allow passing dynamic values.
  • If/Else = branch logic.
  • For Each = loop logic.
  • Repair Runs = selective reruns.
  • Override Params = test/debug flexibility.

This makes Databricks Workflows a lightweight orchestrator (similar to Airflow but native inside Databricks).