Databricks: Databricks Compute (Clusters, Access Modes, Policies, and Permissions)


1. What is Compute in Databricks?

  • Compute = processing power in Databricks.
  • In practice, compute means clusters (a group of virtual machines).
  • A cluster always has:
    • Driver node → coordinates the job.
    • Worker nodes → perform actual data processing.

2. Types of Compute in Databricks

🔹 All-Purpose Compute

  • Interactive clusters used for notebooks, SQL queries, or ad-hoc jobs.
  • Stay running until manually terminated or auto-terminated.
  • Good for:
    • Exploratory data analysis
    • Development
    • Testing

🔹 Job Compute

  • Ephemeral clusters created automatically when you run a scheduled job/workflow.
  • Start when the job runs → terminate immediately after.
  • Good for:
    • Production workloads
    • Automated pipelines
  • Saves cost since cluster exists only while the job runs.

🔹 Serverless Compute (coming in preview/GA by region)

  • Fully managed, no need to configure cluster size/type.
  • Databricks decides resources behind the scenes.

3. Access Modes in Compute

Access modes determine how users and Unity Catalog interact with clusters:

  • Single User → Cluster tied to one user; good for personal work.
  • Shared → Multiple users can attach notebooks; Unity Catalog enabled.
  • No Isolation Shared → Legacy option for Hive metastore, not supported by Unity Catalog.

💡 Best practice:

  • Use Shared clusters with Unity Catalog for team projects.
  • Use Single User clusters for development.

4. Cluster Permissions

You can assign access at the cluster level:

  • Can Manage → Full rights (edit, delete, restart).
  • Can Restart → Start/stop cluster only.
  • Can Attach To → Attach notebooks or SQL queries but cannot stop/start or modify.

5. Cluster Policies

  • A policy = template + restrictions for cluster creation.
  • Unrestricted = full freedom (default).
  • Predefined Policies:
    • Personal Compute → single node, single user.
    • Shared Compute → multi-node, shared mode.
    • Power User Compute → allows scaling.
    • Legacy Shared → for non-Unity Catalog workloads.
  • You can also create custom policies to enforce:
    • Allowed VM types
    • Auto-termination rules
    • Worker/driver size

6. Important Cluster Settings

  • Databricks Runtime (DBR) → Pre-packaged Spark + Scala + Python + libraries.
    • Always pick the latest LTS (Long-Term Support) version.
  • Photon → C++ engine, speeds up Spark SQL jobs, but slightly higher cost.
  • Autoscaling → Define min/max workers; cluster grows/shrinks automatically.
  • Auto-Termination → Saves cost by shutting cluster after X mins of inactivity.
  • VM Types → Choose compute optimized vs memory optimized based on workload.

7. Monitoring & Debugging

Clusters provide:

  • Event Logs → track autoscaling up/down.
  • Spark UI → debug jobs and see DAG execution.
  • Metrics tab → monitor CPU/memory usage.
  • Driver Logs → check stdout, stderr for errors.

8. Key Differences: All Purpose vs Job Compute

FeatureAll Purpose ComputeJob Compute
UsageInteractive (notebooks, SQL)Scheduled Jobs
LifecycleManual start/stopAuto-create, auto-kill
Cost EfficiencyLess efficient if left runningMore efficient
Best forDev & explorationProduction workloads

Conclusion:

  • Use All Purpose Compute for dev/test.
  • Use Job Compute for scheduled production pipelines.
  • Always enable auto-termination and policies to save cost.
  • Prefer Unity Catalog enabled clusters (Single User / Shared) for governance.

Related Posts

Ultimate Career Guide: Best Practices for Entry-Level DataOps Professionals

Introduction Data is now one of the most important assets for modern organizations. Companies depend on data pipelines, analytics dashboards, reporting systems, cloud platforms, and automated workflows…

Read More

Understanding Fundamental Analysis of Stocks for Long Term Equity Investing

Introduction Stepping into the financial world can feel overwhelming, but securing high-quality stock market education is the ultimate way to build long-term wealth. For individuals starting their…

Read More

A Complete Review of the Top Rank Tracking Tools for Local & Global Scale

To win in the modern digital landscape, visibility is everything. Growing brands and busy agencies frequently struggle to balance keyword tracking, technical audits, content creation, creator outreach,…

Read More

Modern DevOps Consulting for Cloud and Kubernetes Success

Introduction Digital‑first businesses are under intense pressure to ship faster, stay secure, and scale reliably across complex multi‑cloud environments. Traditional ways of building and operating software cannot…

Read More

Enterprise DevOps: A Beginner Guide to Scaling IT

Introduction Modern enterprises face the monumental challenge of delivering software at breakneck speeds without sacrificing infrastructure stability. Relying on isolated development and operations teams is no longer…

Read More

Introduction to Automation Testing in DataOps: A Beginner’s Guide

Introduction In modern data engineering, building a data pipeline is only half the battle. The real challenge lies in ensuring that the data flowing through these pipelines…

Read More