1. πΉ Introduction to Workflows
- Databricks Workflow (Job) = a pipeline of tasks (notebooks, scripts, SQL, pipelines, etc.).
- Use cases: ETL orchestration, data quality checks, ML pipelines, conditional branching.
- Each job = multiple tasks with dependencies, parameters, retries, schedules, etc.
2. πΉ Jobs UI Overview
When creating a job:
- Task types: Notebook, Python script, Wheel, JAR, SQL, dbt, Spark submit, If/Else, For Each.
- Cluster: Use job clusters (terminate after run) or all-purpose clusters.
- Parameters: Pass values via widgets (
dbutils.widgets.get()in notebooks). - Notifications: Configure success/failure emails or alerts.
- Retries & Timeouts: Control job resiliency.
- Schedule/Trigger: Run once, on schedule, or triggered by events.
- Permissions: Control who can run/edit/manage jobs.
- Advanced: Queueing, max concurrent runs.
3. πΉ Creating a Job (Example: Process Employee Data)
Workflow:
- Get Day (extract current day name from date).
- Check if Sunday (If/Else branch).
- If True β Process data by department.
- If False β Print “Not Sunday”.
Notebook Setup
- Notebook 1: Get Day
dbutils.widgets.text("input_date", "") input_date = dbutils.widgets.get("input_date") # Get day of week input_day = spark.sql(f""" SELECT date_format(to_timestamp('{input_date}', "yyyy-MM-dd'T'HH:mm:ss"), 'E') as day """).collect()[0].day # Set task value dbutils.jobs.taskValues.set(key="input_day", value=input_day) - Notebook 2: Process Data (department-based ETL).
- Notebook 3: Else branch (just print day).
input_day = dbutils.jobs.taskValues.get(taskKey="01_set_day", key="input_day") print(f"Today is {input_day}, skipping processing.")
4. πΉ Passing Values Between Tasks
- Use
dbutils.jobs.taskValues.set()in producer task. - Retrieve with
dbutils.jobs.taskValues.get(taskKey, key)in consumer task.
β Example: Pass “Sunday” check from Notebook 1 β If/Else task.
5. πΉ Conditional (If/Else) Tasks
- Add If/Else task in Workflow.
- Condition:
- Value from task output = “Sunday”.
- Operator = equals.
- True branch β Run process notebook.
- False branch β Run else notebook.
6. πΉ Re-run Failed Jobs
- Go to Run history β Repair run.
- Select failed tasks β re-execute only those.
- Saves time (no need to rerun entire pipeline).
7. πΉ Override Parameters at Runtime
- Use Run with different parameters in UI.
- Example: Override
input_dateto"2024-10-27T13:00:00"β Forces workflow to evaluate as Sunday.
8. πΉ For Each Loop in Workflows
- Wrap a task (e.g., Process Department Data) inside a For Each loop.
- Provide array of values (static or dynamic).
["sales", "office"] - Each loop iteration passes one value β notebook parameter.
Example notebook parameter setup:
dbutils.widgets.text("department", "")
department = dbutils.widgets.get("department")
print(f"Processing department: {department}")
π‘ This runs the same task multiple times (parallel/sequential) for each department.
9. πΉ Best Practices
- β Always use job clusters (auto-terminate) β cost saving.
- β Centralize parameters at job level, override at task level when needed.
- β Use taskValues for cross-task communication.
- β Use If/Else for conditional ETL or SLA workflows.
- β Use For Each for department-wise ETL, multi-source ingestion, or model training per dataset.
- β Leverage repair runs instead of restarting full pipelines.
10. πΉ Summary
- Jobs orchestrate pipelines.
- Tasks define execution units (Notebook, Python, SQL, etc.).
- Parameters & TaskValues allow passing dynamic values.
- If/Else = branch logic.
- For Each = loop logic.
- Repair Runs = selective reruns.
- Override Params = test/debug flexibility.
This makes Databricks Workflows a lightweight orchestrator (similar to Airflow but native inside Databricks).