Databricks: Databricks Workflows (Jobs, Tasks, Passing Values, If/Else, Re-runs, and Loops)


1. πŸ”Ή Introduction to Workflows

  • Databricks Workflow (Job) = a pipeline of tasks (notebooks, scripts, SQL, pipelines, etc.).
  • Use cases: ETL orchestration, data quality checks, ML pipelines, conditional branching.
  • Each job = multiple tasks with dependencies, parameters, retries, schedules, etc.

2. πŸ”Ή Jobs UI Overview

When creating a job:

  • Task types: Notebook, Python script, Wheel, JAR, SQL, dbt, Spark submit, If/Else, For Each.
  • Cluster: Use job clusters (terminate after run) or all-purpose clusters.
  • Parameters: Pass values via widgets (dbutils.widgets.get() in notebooks).
  • Notifications: Configure success/failure emails or alerts.
  • Retries & Timeouts: Control job resiliency.
  • Schedule/Trigger: Run once, on schedule, or triggered by events.
  • Permissions: Control who can run/edit/manage jobs.
  • Advanced: Queueing, max concurrent runs.

3. πŸ”Ή Creating a Job (Example: Process Employee Data)

Workflow:

  1. Get Day (extract current day name from date).
  2. Check if Sunday (If/Else branch).
  3. If True β†’ Process data by department.
  4. If False β†’ Print “Not Sunday”.

Notebook Setup

  • Notebook 1: Get Day dbutils.widgets.text("input_date", "") input_date = dbutils.widgets.get("input_date") # Get day of week input_day = spark.sql(f""" SELECT date_format(to_timestamp('{input_date}', "yyyy-MM-dd'T'HH:mm:ss"), 'E') as day """).collect()[0].day # Set task value dbutils.jobs.taskValues.set(key="input_day", value=input_day)
  • Notebook 2: Process Data (department-based ETL).
  • Notebook 3: Else branch (just print day). input_day = dbutils.jobs.taskValues.get(taskKey="01_set_day", key="input_day") print(f"Today is {input_day}, skipping processing.")

4. πŸ”Ή Passing Values Between Tasks

  • Use dbutils.jobs.taskValues.set() in producer task.
  • Retrieve with dbutils.jobs.taskValues.get(taskKey, key) in consumer task.

βœ… Example: Pass “Sunday” check from Notebook 1 β†’ If/Else task.


5. πŸ”Ή Conditional (If/Else) Tasks

  • Add If/Else task in Workflow.
  • Condition:
    • Value from task output = “Sunday”.
    • Operator = equals.
  • True branch β†’ Run process notebook.
  • False branch β†’ Run else notebook.

6. πŸ”Ή Re-run Failed Jobs

  • Go to Run history β†’ Repair run.
  • Select failed tasks β†’ re-execute only those.
  • Saves time (no need to rerun entire pipeline).

7. πŸ”Ή Override Parameters at Runtime

  • Use Run with different parameters in UI.
  • Example: Override input_date to "2024-10-27T13:00:00" β†’ Forces workflow to evaluate as Sunday.

8. πŸ”Ή For Each Loop in Workflows

  • Wrap a task (e.g., Process Department Data) inside a For Each loop.
  • Provide array of values (static or dynamic). ["sales", "office"]
  • Each loop iteration passes one value β†’ notebook parameter.

Example notebook parameter setup:

dbutils.widgets.text("department", "")
department = dbutils.widgets.get("department")

print(f"Processing department: {department}")

πŸ’‘ This runs the same task multiple times (parallel/sequential) for each department.


9. πŸ”Ή Best Practices

  • βœ… Always use job clusters (auto-terminate) β†’ cost saving.
  • βœ… Centralize parameters at job level, override at task level when needed.
  • βœ… Use taskValues for cross-task communication.
  • βœ… Use If/Else for conditional ETL or SLA workflows.
  • βœ… Use For Each for department-wise ETL, multi-source ingestion, or model training per dataset.
  • βœ… Leverage repair runs instead of restarting full pipelines.

10. πŸ”Ή Summary

  • Jobs orchestrate pipelines.
  • Tasks define execution units (Notebook, Python, SQL, etc.).
  • Parameters & TaskValues allow passing dynamic values.
  • If/Else = branch logic.
  • For Each = loop logic.
  • Repair Runs = selective reruns.
  • Override Params = test/debug flexibility.

This makes Databricks Workflows a lightweight orchestrator (similar to Airflow but native inside Databricks).


Related Posts

Strategic Cloud Financial Management With Certified FinOps Professional Training

Introduction The Certified FinOps Professional program is a transformative milestone for any engineer or manager looking to master the intersection of finance, technology, and business operations. This…

Read More

Professional Certified FinOps Engineer improves financial performance visibility systems

Introduction In the modern landscape of cloud infrastructure, technical expertise alone is no longer sufficient to drive enterprise success. The Certified FinOps Engineer program has emerged as…

Read More

Complete Cloud Financial Management Guide for Certified FinOps Manager

Introduction The Certified FinOps Manager program is designed to bridge the widening gap between cloud engineering and financial accountability. As cloud environments become more complex, organizations require…

Read More

Industry Ready FinOps Knowledge Through Certified FinOps Architect Program

Introduction The Certified FinOps Architect certification is designed to help professionals bridge the gap between cloud financial management and operational efficiency. This guide is tailored for working…

Read More

Advance Your Data Management Career with CDOM – Certified DataOps Manager

The CDOM – Certified DataOps Manager is a breakthrough certification designed for professionals who want to master the intersection of data engineering and operational agility. This guide…

Read More

Future focused learning with CDOA – Certified DataOps Architect certification

Introduction The CDOA – Certified DataOps Architect is a professional designed to bridge the gap between data engineering and operational excellence. This guide is written for engineers…

Read More