Databricks: Custom Cluster Policies & Instance Pools in Databricks

Perf


1. πŸ”Ή Why Policies and Pools?

  • Policies β†’ Standardize and enforce cluster configurations across your organization.
  • Instance Pools β†’ Pre-create and reuse VMs to reduce startup time for clusters.

These features are critical in enterprise Databricks deployments to enforce compliance, control costs, and improve performance.


2. Custom Cluster Policies in Databricks

πŸ“Œ What is a Cluster Policy?

  • A JSON template that defines allowed, fixed, or forbidden cluster settings.
  • Ensures users follow org standards (e.g., fixed runtime, mandatory auto-termination).

πŸ›  How to Create a Custom Policy

  1. Go to Admin Console β†’ Cluster Policies.
  2. Instead of creating from scratch, clone an existing policy (e.g., Shared Compute).
  3. Edit the JSON to override settings. Example:
{
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10
  },
  "num_workers": {
    "type": "fixed",
    "value": 1
  },
  "autoscale.min_workers": {
    "type": "forbidden"
  },
  "autoscale.max_workers": {
    "type": "forbidden"
  },
  "spark_version": {
    "type": "fixed",
    "value": "15.4.x-scala2.12"
  },
  "node_type_id": {
    "type": "enum",
    "values": ["Standard_DS3_v2", "Standard_DS4_v2"],
    "defaultValue": "Standard_DS4_v2"
  }
}

πŸ”‘ Explanation:

  • Auto termination β†’ Always 10 minutes.
  • Fixed workers β†’ No autoscaling allowed.
  • Fixed runtime β†’ Spark 15.4 only.
  • Restricted VM types β†’ Only Standard_DS3_v2 or DS4_v2 allowed.

πŸ“Œ How to Apply Policy to New Clusters

  • When creating a cluster, select Policy β†’ Custom Policy Name.
  • UI will grey out forbidden fields (e.g., Spark version, node type).

πŸ“Œ Enforcing Policy on Existing Clusters

  • If policy changes (e.g., Spark version updated), old clusters show β€œNon-compliant”.
  • Click Fix All β†’ Databricks auto-updates them to comply.
  • Example: Changing Spark version from 14.3 β†’ 15.4 updates all linked clusters.

βœ… This ensures org-wide compliance instantly.


3. Instance Pools in Databricks

πŸ“Œ What is an Instance Pool?

  • A predefined set of VMs ready to be attached to clusters.
  • Benefit β†’ Reduce startup time (clusters don’t need to wait for VM provisioning).
  • Clusters draw workers from the pool instead of requesting fresh VMs.

πŸ›  How to Create an Instance Pool

  1. Go to Compute β†’ Instance Pools β†’ Create Pool.
  2. Configure:
    • Min Idle Instances β†’ Always running. Keeps pool β€œwarm.”
      • Example: 2 = always 2 ready VMs.
    • Max Capacity β†’ Upper limit of VMs in the pool.
      • Example: 10 = pool can scale up to 10 nodes.
    • Idle Auto Termination β†’ Time (mins) after which unused VMs shut down.
    • Node Type β†’ VM family (e.g., DS4_v2).
    • Databricks Runtime (DBR) β†’ Pre-load runtime for faster attach.

πŸ“Œ Warm vs Cold Pools

  • Warm Pool β†’ Min Idle > 0 (e.g., 2 VMs always running).
    • βœ… Fast startup (sub-second).
    • ❌ Higher cost (pay for idle VMs).
  • Cold Pool β†’ Min Idle = 0.
    • βœ… Cost-efficient (no idle VMs).
    • ❌ Slower startup (still need to spin up VMs).

Example: Warm Instance Pool

Name: demo-pool
Min Idle Instances: 1
Max Capacity: 10
Idle Auto Termination: 10 mins
Node Type: Standard_DS4_v2
Runtime: 15.4 LTS
  • At least 1 VM always running.
  • Jobs launch instantly by borrowing warm node.
  • Released nodes wait 10 mins before termination β†’ reused if another job comes.

4. Best Practices

βœ… Use Custom Policies to:

  • Enforce auto-termination (prevent zombie clusters).
  • Fix runtime versions (e.g., always LTS).
  • Restrict node types to control cost.
  • Disable autoscaling if not needed.

βœ… Use Warm Pools for:

  • Low-latency SLA jobs (e.g., real-time ETL, streaming, dashboards).

βœ… Use Cold Pools for:

  • Batch jobs that can tolerate 2–5 min startup delay.

5. Key Differences: Policy vs Pool

FeatureCluster Policy 🚦Instance Pool 🏊
PurposeEnforce rulesReduce startup time
ControlsRuntime, nodes, auto-terminationVM availability
Cost ImpactAvoids misuseMay add idle VM costs
GovernanceCompliance toolPerformance tool

βœ… Conclusion:

  • Use Policies for governance and cost control.
  • Use Pools to optimize SLA and startup latency.
  • Combine both: Policy + Pool-backed clusters = controlled + fast compute.

Related Posts

Strategic Cloud Financial Management With Certified FinOps Professional Training

Introduction The Certified FinOps Professional program is a transformative milestone for any engineer or manager looking to master the intersection of finance, technology, and business operations. This…

Read More

Professional Certified FinOps Engineer improves financial performance visibility systems

Introduction In the modern landscape of cloud infrastructure, technical expertise alone is no longer sufficient to drive enterprise success. The Certified FinOps Engineer program has emerged as…

Read More

Complete Cloud Financial Management Guide for Certified FinOps Manager

Introduction The Certified FinOps Manager program is designed to bridge the widening gap between cloud engineering and financial accountability. As cloud environments become more complex, organizations require…

Read More

Industry Ready FinOps Knowledge Through Certified FinOps Architect Program

Introduction The Certified FinOps Architect certification is designed to help professionals bridge the gap between cloud financial management and operational efficiency. This guide is tailored for working…

Read More

Advance Your Data Management Career with CDOM – Certified DataOps Manager

The CDOM – Certified DataOps Manager is a breakthrough certification designed for professionals who want to master the intersection of data engineering and operational agility. This guide…

Read More

Future focused learning with CDOA – Certified DataOps Architect certification

Introduction The CDOA – Certified DataOps Architect is a professional designed to bridge the gap between data engineering and operational excellence. This guide is written for engineers…

Read More