Rajesh Kumar August 22, 2025 0


1. What is Compute in Databricks?

  • Compute = processing power in Databricks.
  • In practice, compute means clusters (a group of virtual machines).
  • A cluster always has:
    • Driver node โ†’ coordinates the job.
    • Worker nodes โ†’ perform actual data processing.

2. Types of Compute in Databricks

๐Ÿ”น All-Purpose Compute

  • Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
  • Stay running until manually terminated or auto-terminated.
  • Good for:
    • Exploratory data analysis
    • Development
    • Testing

๐Ÿ”น Job Compute

  • Ephemeral clusters created automatically when you run a scheduled job/workflow.
  • Start when the job runs โ†’ terminate immediately after.
  • Good for:
    • Production workloads
    • Automated pipelines
  • Saves cost since cluster exists only while the job runs.

๐Ÿ”น Serverless Compute (coming in preview/GA by region)

  • Fully managed, no need to configure cluster size/type.
  • Databricks decides resources behind the scenes.

3. Access Modes in Compute

Access modes determine how users and Unity Catalog interact with clusters:

  • Single User โ†’ Cluster tied to one user; good for personal work.
  • Shared โ†’ Multiple users can attach notebooks; Unity Catalog enabled.
  • No Isolation Shared โ†’ Legacy option for Hive metastore, not supported by Unity Catalog.

๐Ÿ’ก Best practice:

  • Use Shared clusters with Unity Catalog for team projects.
  • Use Single User clusters for development.

4. Cluster Permissions

You can assign access at the cluster level:

  • Can Manage โ†’ Full rights (edit, delete, restart).
  • Can Restart โ†’ Start/stop cluster only.
  • Can Attach To โ†’ Attach notebooks or SQL queries but cannot stop/start or modify.

5. Cluster Policies

  • A policy = template + restrictions for cluster creation.
  • Unrestricted = full freedom (default).
  • Predefined Policies:
    • Personal Compute โ†’ single node, single user.
    • Shared Compute โ†’ multi-node, shared mode.
    • Power User Compute โ†’ allows scaling.
    • Legacy Shared โ†’ for non-Unity Catalog workloads.
  • You can also create custom policies to enforce:
    • Allowed VM types
    • Auto-termination rules
    • Worker/driver size

6. Important Cluster Settings

  • Databricks Runtime (DBR) โ†’ Pre-packaged Spark + Scala + Python + libraries.
    • Always pick the latest LTS (Long-Term Support) version.
  • Photon โ†’ C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
  • Autoscaling โ†’ Define min/max workers; cluster grows/shrinks automatically.
  • Auto-Termination โ†’ Saves cost by shutting cluster after X mins of inactivity.
  • VM Types โ†’ Choose compute optimized vs memory optimized based on workload.

7. Monitoring & Debugging

Clusters provide:

  • Event Logs โ†’ track autoscaling up/down.
  • Spark UI โ†’ debug jobs and see DAG execution.
  • Metrics tab โ†’ monitor CPU/memory usage.
  • Driver Logs โ†’ check stdout, stderr for errors.

8. Key Differences: All Purpose vs Job Compute

FeatureAll Purpose ComputeJob Compute
UsageInteractive (notebooks, SQL)Scheduled Jobs
LifecycleManual start/stopAuto-create, auto-kill
Cost EfficiencyLess efficient if left runningMore efficient
Best forDev & explorationProduction workloads

โœ… Conclusion:

  • Use All Purpose Compute for dev/test.
  • Use Job Compute for scheduled production pipelines.
  • Always enable auto-termination and policies to save cost.
  • Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.

Category: