1. What is Compute in Databricks?
- Compute = processing power in Databricks.
 - In practice, compute means clusters (a group of virtual machines).
 - A cluster always has:
- Driver node → coordinates the job.
 - Worker nodes → perform actual data processing.
 
 
2. Types of Compute in Databricks
🔹 All-Purpose Compute
- Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
 - Stay running until manually terminated or auto-terminated.
 - Good for:
- Exploratory data analysis
 - Development
 - Testing
 
 
🔹 Job Compute
- Ephemeral clusters created automatically when you run a scheduled job/workflow.
 - Start when the job runs → terminate immediately after.
 - Good for:
- Production workloads
 - Automated pipelines
 
 - Saves cost since cluster exists only while the job runs.
 
🔹 Serverless Compute (coming in preview/GA by region)
- Fully managed, no need to configure cluster size/type.
 - Databricks decides resources behind the scenes.
 
3. Access Modes in Compute
Access modes determine how users and Unity Catalog interact with clusters:
- Single User → Cluster tied to one user; good for personal work.
 - Shared → Multiple users can attach notebooks; Unity Catalog enabled.
 - No Isolation Shared → Legacy option for Hive metastore, not supported by Unity Catalog.
 
💡 Best practice:
- Use Shared clusters with Unity Catalog for team projects.
 - Use Single User clusters for development.
 
4. Cluster Permissions
You can assign access at the cluster level:
- Can Manage → Full rights (edit, delete, restart).
 - Can Restart → Start/stop cluster only.
 - Can Attach To → Attach notebooks or SQL queries but cannot stop/start or modify.
 
5. Cluster Policies
- A policy = template + restrictions for cluster creation.
 - Unrestricted = full freedom (default).
 - Predefined Policies:
- Personal Compute → single node, single user.
 - Shared Compute → multi-node, shared mode.
 - Power User Compute → allows scaling.
 - Legacy Shared → for non-Unity Catalog workloads.
 
 - You can also create custom policies to enforce:
- Allowed VM types
 - Auto-termination rules
 - Worker/driver size
 
 
6. Important Cluster Settings
- Databricks Runtime (DBR) → Pre-packaged Spark + Scala + Python + libraries.
- Always pick the latest LTS (Long-Term Support) version.
 
 - Photon → C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
 - Autoscaling → Define min/max workers; cluster grows/shrinks automatically.
 - Auto-Termination → Saves cost by shutting cluster after X mins of inactivity.
 - VM Types → Choose compute optimized vs memory optimized based on workload.
 
7. Monitoring & Debugging
Clusters provide:
- Event Logs → track autoscaling up/down.
 - Spark UI → debug jobs and see DAG execution.
 - Metrics tab → monitor CPU/memory usage.
 - Driver Logs → check stdout, stderr for errors.
 
8. Key Differences: All Purpose vs Job Compute
| Feature | All Purpose Compute | Job Compute | 
|---|---|---|
| Usage | Interactive (notebooks, SQL) | Scheduled Jobs | 
| Lifecycle | Manual start/stop | Auto-create, auto-kill | 
| Cost Efficiency | Less efficient if left running | More efficient | 
| Best for | Dev & exploration | Production workloads | 
✅ Conclusion:
- Use All Purpose Compute for dev/test.
 - Use Job Compute for scheduled production pipelines.
 - Always enable auto-termination and policies to save cost.
 - Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.