Databricks Cost Optimization: 12 Strategies to Cut Your Cloud Bill
Databricks Cost Optimization: 12 Strategies to Cut Your Cloud Bill
Databricks is powerful — and like all powerful infrastructure, it can get expensive fast. Teams routinely discover surprise bills from idle clusters, over-provisioned nodes, and inefficient queries. In this guide, we cover 12 actionable strategies that data engineering teams use to reduce Databricks spending without sacrificing performance.
Understanding Databricks Pricing
Before optimizing, understand what drives costs:
| Cost Driver | Description |
|---|---|
| DBUs (Databricks Units) | Compute time × node tier (Jobs, All-Purpose, SQL) |
| Cloud VMs | Underlying AWS/Azure/GCP instance costs |
| Storage | Delta Lake files in cloud object storage |
| Network egress | Cross-region data transfer |
Jobs compute is typically 2–3× cheaper than All-Purpose compute for the same workload — this single insight drives several optimizations below.
Strategy 1: Kill Idle All-Purpose Clusters
The #1 cost killer: clusters left running after work hours or during lunch. A forgotten 4-node cluster at $0.40/DBU × 4 nodes × 10 idle hours = $16/day = $5,840/year.
Set aggressive auto-termination:
import requests
DATABRICKS_HOST = "https://adb-WORKSPACE_ID.azuredatabricks.net"
TOKEN = "dapi..."
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/clusters/edit",
headers={"Authorization": f"Bearer {TOKEN}"},
json={
"cluster_id": "YOUR_CLUSTER_ID",
"autotermination_minutes": 20,
}
)
Default cluster policies (enforce via Cluster Policies):
{
"autotermination_minutes": {
"type": "fixed",
"value": 30
}
}
Strategy 2: Move Batch Jobs to Jobs Compute
All-Purpose clusters are for interactive development. Production jobs should always use Jobs compute (job clusters) — they're cheaper and isolated:
# In your Databricks Asset Bundle (databricks.yml)
resources:
jobs:
nightly_etl:
tasks:
- task_key: transform
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 4
Jobs clusters spin up for the job, run, and terminate — you pay only for runtime.
Strategy 3: Use Spot/Preemptible Instances
Spot instances (AWS) or Spot VMs (Azure) are typically 60–80% cheaper than on-demand. For fault-tolerant batch workloads, use a hybrid: on-demand driver + spot workers.
cluster_config = {
"cluster_name": "spot-workers-cluster",
"spark_version": "14.3.x-scala2.12",
"driver_node_type_id": "Standard_DS3_v2",
"node_type_id": "Standard_DS3_v2",
"num_workers": 8,
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
}
}
When to use spot: Bronze ingestion, Silver batch transforms, ML training. When to avoid spot: Interactive notebooks, streaming with exactly-once semantics.
Strategy 4: Right-Size Your Clusters
Over-provisioning is rampant. Most teams run clusters 2–3× larger than needed.
Profile before sizing:
SELECT
application_id,
SUM(disk_bytes_spilled) / 1e9 AS gb_spilled_to_disk,
SUM(memory_bytes_spilled) / 1e9 AS gb_spilled_to_memory
FROM system.runtime.query_history
WHERE date_day >= current_date() - 7
GROUP BY 1
ORDER BY 2 DESC
Sizing heuristics:
| Workload | Recommendation |
|---|---|
| SQL analytics | 2–4 nodes, memory-optimized |
| Spark ETL (small) | 2–8 nodes, general-purpose |
| Spark ETL (large) | 8–32 nodes, compute-optimized |
| ML training | GPU nodes only when needed |
Strategy 5: Enable Photon Engine
Photon is Databricks' vectorized query engine — it dramatically speeds up SQL and DataFrame operations, meaning jobs finish faster and cost less:
cluster_config = {
"runtime_engine": "PHOTON",
"node_type_id": "Standard_DS3_v2",
"spark_version": "14.3.x-photon-scala2.12"
}
Photon typically delivers 2–8× speedups on SQL-heavy workloads. Since you pay per DBU-hour, faster = cheaper.
Best Photon workloads: Delta Lake reads/writes, SQL aggregations, joins, sorts.
Strategy 6: Use SQL Warehouses for BI Queries
Never use All-Purpose clusters for BI tool connections (Tableau, Power BI, Looker). Use Databricks SQL Warehouses instead — they're designed for concurrent SQL and are significantly cheaper for that pattern:
SELECT
warehouse_id,
SUM(execution_duration_ms) / 3600000.0 AS total_hours,
COUNT(*) AS query_count,
AVG(execution_duration_ms / 1000.0) AS avg_duration_sec
FROM system.query.history
WHERE DATE(start_time) >= CURRENT_DATE - 30
GROUP BY 1
ORDER BY 2 DESC;
Configure auto-stop and auto-scale:
{
"auto_stop_mins": 10,
"min_num_clusters": 1,
"max_num_clusters": 3,
"cluster_size": "Small"
}
Strategy 7: Partition and Z-Order Strategically
Poorly designed tables force full scans, wasting compute. Use partitioning and Z-ordering to minimize data scanned:
CREATE TABLE prod.silver.events
USING DELTA
PARTITIONED BY (event_date)
AS SELECT *, CAST(event_timestamp AS DATE) AS event_date
FROM prod.bronze.events_raw;
OPTIMIZE prod.silver.events
ZORDER BY (user_id, event_type);
Impact: Partition pruning can reduce scanned data by 90%+, directly cutting query cost.
Strategy 8: Leverage Delta Cache
The Delta cache keeps decoded columnar data in SSD/memory for reuse. Enable it on frequently read tables:
spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g")
Pre-warm the cache before scheduled jobs:
spark.sql("CACHE SELECT * FROM prod.gold.customers")
spark.sql("CACHE SELECT * FROM prod.silver.products")
Strategy 9: Vacuum Delta Tables Regularly
Delta keeps old file versions for time travel. Without VACUUM, storage costs compound:
VACUUM prod.silver.events RETAIN 168 HOURS;
ALTER TABLE prod.silver.events
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 3 days');
def vacuum_all_tables(schema: str, retention_hours: int = 168):
# Vacuum all Delta tables in a schema
tables = spark.sql(f"SHOW TABLES IN {schema}").collect()
for row in tables:
table = f"{schema}.{row['tableName']}"
print(f"Vacuuming {table}...")
spark.sql(f"VACUUM {table} RETAIN {retention_hours} HOURS")
vacuum_all_tables("prod.silver", retention_hours=72)
Strategy 10: Implement Cluster Policies
Prevent engineers from spinning up oversized clusters by enforcing policies:
policy = {
"name": "Data Engineer Standard",
"definition": {
"dbus_per_hour": {
"type": "range",
"maxValue": 100
},
"autotermination_minutes": {
"type": "fixed",
"value": 30
},
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2"]
},
"num_workers": {
"type": "range",
"maxValue": 8
}
}
}
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/policies/clusters/create",
headers={"Authorization": f"Bearer {TOKEN}"},
json=policy
)
Strategy 11: Use Structured Streaming Efficiently
Streaming clusters often run 24/7 — optimize them carefully:
# Micro-batch every 5 minutes instead of continuous
df.writeStream \
.trigger(processingTime="5 minutes") \
.format("delta") \
.option("checkpointLocation", "/checkpoints/events") \
.table("prod.silver.events") \
.start()
# For batch workloads that can run on a schedule
df.writeStream \
.trigger(availableNow=True) \
.format("delta") \
.table("prod.silver.events") \
.start() \
.awaitTermination()
Strategy 12: Monitor Costs Continuously
You can't optimize what you don't measure. Use Databricks System Tables:
SELECT
job_id,
job_name,
SUM(dbu_consumed) AS total_dbus,
SUM(dbu_consumed) * 0.40 AS estimated_cost_usd,
COUNT(*) AS run_count
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE - 30
AND usage_type = 'JOBS_COMPUTE'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 20;
Connect your billing data to a monitoring tool like Harbinger Explorer for automated anomaly detection — get alerted when a job's DBU consumption spikes unexpectedly.
Cost Optimization Checklist
| Action | Estimated Savings |
|---|---|
| Auto-terminate idle clusters (30 min) | 20–40% |
| Migrate batch to Jobs compute | 15–30% |
| Enable spot workers | 40–60% |
| Right-size clusters | 20–35% |
| Enable Photon | 10–25% (via speed) |
| Optimize partitioning/Z-Order | 10–30% |
| Regular VACUUM | 5–15% (storage) |
| Cluster policies | 10–25% (governance) |
Implementing all these strategies together can cut a typical Databricks bill by 50–70%.
Try Harbinger Explorer free for 7 days — get automated cost anomaly detection, cluster utilization reports, and DBU budget alerts across your entire Databricks workspace. Start your free trial at harbingerexplorer.com
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial