1. 🔹 Introduction to Workflows
- Databricks Workflow (Job) = a pipeline of tasks (notebooks, scripts, SQL, pipelines, etc.).
- Use cases: ETL orchestration, data quality checks, ML pipelines, conditional branching.
- Each job = multiple tasks with dependencies, parameters, retries, schedules, etc.
2. 🔹 Jobs UI Overview
When creating a job:
- Task types: Notebook, Python script, Wheel, JAR, SQL, dbt, Spark submit, If/Else, For Each.
- Cluster: Use job clusters (terminate after run) or all-purpose clusters.
- Parameters: Pass values via widgets (
dbutils.widgets.get()
in notebooks). - Notifications: Configure success/failure emails or alerts.
- Retries & Timeouts: Control job resiliency.
- Schedule/Trigger: Run once, on schedule, or triggered by events.
- Permissions: Control who can run/edit/manage jobs.
- Advanced: Queueing, max concurrent runs.
3. 🔹 Creating a Job (Example: Process Employee Data)
Workflow:
- Get Day (extract current day name from date).
- Check if Sunday (If/Else branch).
- If True → Process data by department.
- If False → Print “Not Sunday”.
Notebook Setup
- Notebook 1: Get Day
dbutils.widgets.text("input_date", "") input_date = dbutils.widgets.get("input_date") # Get day of week input_day = spark.sql(f""" SELECT date_format(to_timestamp('{input_date}', "yyyy-MM-dd'T'HH:mm:ss"), 'E') as day """).collect()[0].day # Set task value dbutils.jobs.taskValues.set(key="input_day", value=input_day)
- Notebook 2: Process Data (department-based ETL).
- Notebook 3: Else branch (just print day).
input_day = dbutils.jobs.taskValues.get(taskKey="01_set_day", key="input_day") print(f"Today is {input_day}, skipping processing.")
4. 🔹 Passing Values Between Tasks
- Use
dbutils.jobs.taskValues.set()
in producer task. - Retrieve with
dbutils.jobs.taskValues.get(taskKey, key)
in consumer task.
✅ Example: Pass “Sunday” check from Notebook 1 → If/Else task.
5. 🔹 Conditional (If/Else) Tasks
- Add If/Else task in Workflow.
- Condition:
- Value from task output = “Sunday”.
- Operator = equals.
- True branch → Run process notebook.
- False branch → Run else notebook.
6. 🔹 Re-run Failed Jobs
- Go to Run history → Repair run.
- Select failed tasks → re-execute only those.
- Saves time (no need to rerun entire pipeline).
7. 🔹 Override Parameters at Runtime
- Use Run with different parameters in UI.
- Example: Override
input_date
to"2024-10-27T13:00:00"
→ Forces workflow to evaluate as Sunday.
8. 🔹 For Each Loop in Workflows
- Wrap a task (e.g., Process Department Data) inside a For Each loop.
- Provide array of values (static or dynamic).
["sales", "office"]
- Each loop iteration passes one value → notebook parameter.
Example notebook parameter setup:
dbutils.widgets.text("department", "")
department = dbutils.widgets.get("department")
print(f"Processing department: {department}")
💡 This runs the same task multiple times (parallel/sequential) for each department.
9. 🔹 Best Practices
- ✅ Always use job clusters (auto-terminate) → cost saving.
- ✅ Centralize parameters at job level, override at task level when needed.
- ✅ Use taskValues for cross-task communication.
- ✅ Use If/Else for conditional ETL or SLA workflows.
- ✅ Use For Each for department-wise ETL, multi-source ingestion, or model training per dataset.
- ✅ Leverage repair runs instead of restarting full pipelines.
10. 🔹 Summary
- Jobs orchestrate pipelines.
- Tasks define execution units (Notebook, Python, SQL, etc.).
- Parameters & TaskValues allow passing dynamic values.
- If/Else = branch logic.
- For Each = loop logic.
- Repair Runs = selective reruns.
- Override Params = test/debug flexibility.
This makes Databricks Workflows a lightweight orchestrator (similar to Airflow but native inside Databricks).