Databricks: Databricks Compute (Clusters, Access Modes, Policies, and Permissions)


1. What is Compute in Databricks?

  • Compute = processing power in Databricks.
  • In practice, compute means clusters (a group of virtual machines).
  • A cluster always has:
    • Driver node → coordinates the job.
    • Worker nodes → perform actual data processing.

2. Types of Compute in Databricks

🔹 All-Purpose Compute

  • Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
  • Stay running until manually terminated or auto-terminated.
  • Good for:
    • Exploratory data analysis
    • Development
    • Testing

🔹 Job Compute

  • Ephemeral clusters created automatically when you run a scheduled job/workflow.
  • Start when the job runs → terminate immediately after.
  • Good for:
    • Production workloads
    • Automated pipelines
  • Saves cost since cluster exists only while the job runs.

🔹 Serverless Compute (coming in preview/GA by region)

  • Fully managed, no need to configure cluster size/type.
  • Databricks decides resources behind the scenes.

3. Access Modes in Compute

Access modes determine how users and Unity Catalog interact with clusters:

  • Single User → Cluster tied to one user; good for personal work.
  • Shared → Multiple users can attach notebooks; Unity Catalog enabled.
  • No Isolation Shared → Legacy option for Hive metastore, not supported by Unity Catalog.

💡 Best practice:

  • Use Shared clusters with Unity Catalog for team projects.
  • Use Single User clusters for development.

4. Cluster Permissions

You can assign access at the cluster level:

  • Can Manage → Full rights (edit, delete, restart).
  • Can Restart → Start/stop cluster only.
  • Can Attach To → Attach notebooks or SQL queries but cannot stop/start or modify.

5. Cluster Policies

  • A policy = template + restrictions for cluster creation.
  • Unrestricted = full freedom (default).
  • Predefined Policies:
    • Personal Compute → single node, single user.
    • Shared Compute → multi-node, shared mode.
    • Power User Compute → allows scaling.
    • Legacy Shared → for non-Unity Catalog workloads.
  • You can also create custom policies to enforce:
    • Allowed VM types
    • Auto-termination rules
    • Worker/driver size

6. Important Cluster Settings

  • Databricks Runtime (DBR) → Pre-packaged Spark + Scala + Python + libraries.
    • Always pick the latest LTS (Long-Term Support) version.
  • Photon → C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
  • Autoscaling → Define min/max workers; cluster grows/shrinks automatically.
  • Auto-Termination → Saves cost by shutting cluster after X mins of inactivity.
  • VM Types → Choose compute optimized vs memory optimized based on workload.

7. Monitoring & Debugging

Clusters provide:

  • Event Logs → track autoscaling up/down.
  • Spark UI → debug jobs and see DAG execution.
  • Metrics tab → monitor CPU/memory usage.
  • Driver Logs → check stdout, stderr for errors.

8. Key Differences: All Purpose vs Job Compute

FeatureAll Purpose ComputeJob Compute
UsageInteractive (notebooks, SQL)Scheduled Jobs
LifecycleManual start/stopAuto-create, auto-kill
Cost EfficiencyLess efficient if left runningMore efficient
Best forDev & explorationProduction workloads

✅ Conclusion:

  • Use All Purpose Compute for dev/test.
  • Use Job Compute for scheduled production pipelines.
  • Always enable auto-termination and policies to save cost.
  • Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.