1. What is Compute in Databricks?
- Compute = processing power in Databricks.
- In practice, compute means clusters (a group of virtual machines).
- A cluster always has:
- Driver node โ coordinates the job.
- Worker nodes โ perform actual data processing.
2. Types of Compute in Databricks
๐น All-Purpose Compute
- Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
- Stay running until manually terminated or auto-terminated.
- Good for:
- Exploratory data analysis
- Development
- Testing
๐น Job Compute
- Ephemeral clusters created automatically when you run a scheduled job/workflow.
- Start when the job runs โ terminate immediately after.
- Good for:
- Production workloads
- Automated pipelines
- Saves cost since cluster exists only while the job runs.
๐น Serverless Compute (coming in preview/GA by region)
- Fully managed, no need to configure cluster size/type.
- Databricks decides resources behind the scenes.
3. Access Modes in Compute
Access modes determine how users and Unity Catalog interact with clusters:
- Single User โ Cluster tied to one user; good for personal work.
- Shared โ Multiple users can attach notebooks; Unity Catalog enabled.
- No Isolation Shared โ Legacy option for Hive metastore, not supported by Unity Catalog.
๐ก Best practice:
- Use Shared clusters with Unity Catalog for team projects.
- Use Single User clusters for development.
4. Cluster Permissions
You can assign access at the cluster level:
- Can Manage โ Full rights (edit, delete, restart).
- Can Restart โ Start/stop cluster only.
- Can Attach To โ Attach notebooks or SQL queries but cannot stop/start or modify.
5. Cluster Policies
- A policy = template + restrictions for cluster creation.
- Unrestricted = full freedom (default).
- Predefined Policies:
- Personal Compute โ single node, single user.
- Shared Compute โ multi-node, shared mode.
- Power User Compute โ allows scaling.
- Legacy Shared โ for non-Unity Catalog workloads.
- You can also create custom policies to enforce:
- Allowed VM types
- Auto-termination rules
- Worker/driver size
6. Important Cluster Settings
- Databricks Runtime (DBR) โ Pre-packaged Spark + Scala + Python + libraries.
- Always pick the latest LTS (Long-Term Support) version.
- Photon โ C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
- Autoscaling โ Define min/max workers; cluster grows/shrinks automatically.
- Auto-Termination โ Saves cost by shutting cluster after X mins of inactivity.
- VM Types โ Choose compute optimized vs memory optimized based on workload.
7. Monitoring & Debugging
Clusters provide:
- Event Logs โ track autoscaling up/down.
- Spark UI โ debug jobs and see DAG execution.
- Metrics tab โ monitor CPU/memory usage.
- Driver Logs โ check stdout, stderr for errors.
8. Key Differences: All Purpose vs Job Compute
| Feature | All Purpose Compute | Job Compute |
|---|---|---|
| Usage | Interactive (notebooks, SQL) | Scheduled Jobs |
| Lifecycle | Manual start/stop | Auto-create, auto-kill |
| Cost Efficiency | Less efficient if left running | More efficient |
| Best for | Dev & exploration | Production workloads |
โ Conclusion:
- Use All Purpose Compute for dev/test.
- Use Job Compute for scheduled production pipelines.
- Always enable auto-termination and policies to save cost.
- Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.
Category: