1. What is Compute in Databricks?
- Compute = processing power in Databricks.
- In practice, compute means clusters (a group of virtual machines).
- A cluster always has:
- Driver node → coordinates the job.
- Worker nodes → perform actual data processing.
2. Types of Compute in Databricks
🔹 All-Purpose Compute
- Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
- Stay running until manually terminated or auto-terminated.
- Good for:
- Exploratory data analysis
- Development
- Testing
🔹 Job Compute
- Ephemeral clusters created automatically when you run a scheduled job/workflow.
- Start when the job runs → terminate immediately after.
- Good for:
- Production workloads
- Automated pipelines
- Saves cost since cluster exists only while the job runs.
🔹 Serverless Compute (coming in preview/GA by region)
- Fully managed, no need to configure cluster size/type.
- Databricks decides resources behind the scenes.
3. Access Modes in Compute
Access modes determine how users and Unity Catalog interact with clusters:
- Single User → Cluster tied to one user; good for personal work.
- Shared → Multiple users can attach notebooks; Unity Catalog enabled.
- No Isolation Shared → Legacy option for Hive metastore, not supported by Unity Catalog.
💡 Best practice:
- Use Shared clusters with Unity Catalog for team projects.
- Use Single User clusters for development.
4. Cluster Permissions
You can assign access at the cluster level:
- Can Manage → Full rights (edit, delete, restart).
- Can Restart → Start/stop cluster only.
- Can Attach To → Attach notebooks or SQL queries but cannot stop/start or modify.
5. Cluster Policies
- A policy = template + restrictions for cluster creation.
- Unrestricted = full freedom (default).
- Predefined Policies:
- Personal Compute → single node, single user.
- Shared Compute → multi-node, shared mode.
- Power User Compute → allows scaling.
- Legacy Shared → for non-Unity Catalog workloads.
- You can also create custom policies to enforce:
- Allowed VM types
- Auto-termination rules
- Worker/driver size
6. Important Cluster Settings
- Databricks Runtime (DBR) → Pre-packaged Spark + Scala + Python + libraries.
- Always pick the latest LTS (Long-Term Support) version.
- Photon → C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
- Autoscaling → Define min/max workers; cluster grows/shrinks automatically.
- Auto-Termination → Saves cost by shutting cluster after X mins of inactivity.
- VM Types → Choose compute optimized vs memory optimized based on workload.
7. Monitoring & Debugging
Clusters provide:
- Event Logs → track autoscaling up/down.
- Spark UI → debug jobs and see DAG execution.
- Metrics tab → monitor CPU/memory usage.
- Driver Logs → check stdout, stderr for errors.
8. Key Differences: All Purpose vs Job Compute
Feature | All Purpose Compute | Job Compute |
---|---|---|
Usage | Interactive (notebooks, SQL) | Scheduled Jobs |
Lifecycle | Manual start/stop | Auto-create, auto-kill |
Cost Efficiency | Less efficient if left running | More efficient |
Best for | Dev & exploration | Production workloads |
✅ Conclusion:
- Use All Purpose Compute for dev/test.
- Use Job Compute for scheduled production pipelines.
- Always enable auto-termination and policies to save cost.
- Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.