Databricks: Databricks Compute (Clusters, Access Modes, Policies, and Permissions)

1. What is Compute in Databricks?

Compute = processing power in Databricks.
In practice, compute means clusters (a group of virtual machines).
A cluster always has:
- Driver node → coordinates the job.
- Worker nodes → perform actual data processing.

Ephemeral clusters created automatically when you run a scheduled job/workflow.
Start when the job runs → terminate immediately after.
Good for:
- Production workloads
- Automated pipelines
Saves cost since cluster exists only while the job runs.

Access modes determine how users and Unity Catalog interact with clusters:

Single User → Cluster tied to one user; good for personal work.
Shared → Multiple users can attach notebooks; Unity Catalog enabled.
No Isolation Shared → Legacy option for Hive metastore, not supported by Unity Catalog.

💡 Best practice:

You can assign access at the cluster level:

Can Manage → Full rights (edit, delete, restart).
Can Restart → Start/stop cluster only.
Can Attach To → Attach notebooks or SQL queries but cannot stop/start or modify.

A policy = template + restrictions for cluster creation.
Unrestricted = full freedom (default).
Predefined Policies:
- Personal Compute → single node, single user.
- Shared Compute → multi-node, shared mode.
- Power User Compute → allows scaling.
- Legacy Shared → for non-Unity Catalog workloads.
You can also create custom policies to enforce:
- Allowed VM types
- Auto-termination rules
- Worker/driver size

Databricks Runtime (DBR) → Pre-packaged Spark + Scala + Python + libraries.
- Always pick the latest LTS (Long-Term Support) version.
Photon → C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
Autoscaling → Define min/max workers; cluster grows/shrinks automatically.
Auto-Termination → Saves cost by shutting cluster after X mins of inactivity.
VM Types → Choose compute optimized vs memory optimized based on workload.

Clusters provide:

Feature	All Purpose Compute	Job Compute
Usage	Interactive (notebooks, SQL)	Scheduled Jobs
Lifecycle	Manual start/stop	Auto-create, auto-kill
Cost Efficiency	Less efficient if left running	More efficient
Best for	Dev & exploration	Production workloads

✅ Conclusion:

Use All Purpose Compute for dev/test.
Use Job Compute for scheduled production pipelines.
Always enable auto-termination and policies to save cost.
Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.