Databricks: Databricks Workflows (Jobs, Tasks, Passing Values, If/Else, Re-runs, and Loops)

1. 🔹 Introduction to Workflows

Databricks Workflow (Job) = a pipeline of tasks (notebooks, scripts, SQL, pipelines, etc.).
Use cases: ETL orchestration, data quality checks, ML pipelines, conditional branching.
Each job = multiple tasks with dependencies, parameters, retries, schedules, etc.

When creating a job:

Task types: Notebook, Python script, Wheel, JAR, SQL, dbt, Spark submit, If/Else, For Each.
Cluster: Use job clusters (terminate after run) or all-purpose clusters.
Parameters: Pass values via widgets (dbutils.widgets.get() in notebooks).
Notifications: Configure success/failure emails or alerts.
Retries & Timeouts: Control job resiliency.
Schedule/Trigger: Run once, on schedule, or triggered by events.
Permissions: Control who can run/edit/manage jobs.
Advanced: Queueing, max concurrent runs.

Workflow:

Notebook 1: Get Day dbutils.widgets.text("input_date", "") input_date = dbutils.widgets.get("input_date") # Get day of week input_day = spark.sql(f""" SELECT date_format(to_timestamp('{input_date}', "yyyy-MM-dd'T'HH:mm:ss"), 'E') as day """).collect()[0].day # Set task value dbutils.jobs.taskValues.set(key="input_day", value=input_day)
Notebook 2: Process Data (department-based ETL).
Notebook 3: Else branch (just print day). input_day = dbutils.jobs.taskValues.get(taskKey="01_set_day", key="input_day") print(f"Today is {input_day}, skipping processing.")

✅ Example: Pass “Sunday” check from Notebook 1 → If/Else task.

Use Run with different parameters in UI.
Example: Override input_date to "2024-10-27T13:00:00" → Forces workflow to evaluate as Sunday.

Example notebook parameter setup:

dbutils.widgets.text("department", "")
department = dbutils.widgets.get("department")

print(f"Processing department: {department}")

💡 This runs the same task multiple times (parallel/sequential) for each department.

✅ Always use job clusters (auto-terminate) → cost saving.
✅ Centralize parameters at job level, override at task level when needed.
✅ Use taskValues for cross-task communication.
✅ Use If/Else for conditional ETL or SLA workflows.
✅ Use For Each for department-wise ETL, multi-source ingestion, or model training per dataset.
✅ Leverage repair runs instead of restarting full pipelines.

This makes Databricks Workflows a lightweight orchestrator (similar to Airflow but native inside Databricks).