Monitoring and Alerting for Databricks Workloads: A Complete Guide
Monitoring and Alerting for Databricks Workloads: A Complete Guide
Running Databricks in production without proper monitoring is like flying blind. Jobs fail silently, clusters run out of memory at 3 AM, and pipelines stall for hours before anyone notices. Building a solid observability layer is not optional — it's the difference between a reliable data platform and a constant firefighting exercise.
This guide covers everything you need to instrument, monitor, and alert on Databricks workloads: from native job alerts to Spark metrics, cluster health, and integration with external platforms like Datadog, PagerDuty, and Slack.
The Observability Stack for Databricks
A production-grade observability setup for Databricks spans four layers:
| Layer | What to Monitor | Tools |
|---|---|---|
| Job/Workflow | Run status, duration, failure rate | Databricks Alerts, Workflows UI |
| Cluster | CPU, memory, disk, GC pressure | Ganglia, Spark UI, CloudWatch/Azure Monitor |
| Spark Metrics | Task failures, shuffle read/write, spill | Spark metrics sink, Datadog |
| Data Quality | Row counts, null rates, schema drift | Great Expectations, dbt tests, custom checks |
Most teams instrument only the first layer and wonder why they still get surprised by failures. True observability requires all four.
Layer 1: Job and Workflow Alerts
Native Databricks Job Alerts
Databricks Workflows has built-in alerting for job runs. You can configure alerts via the UI or the Jobs API:
# Create a job with email alerts via CLI
databricks jobs create --json '{
"name": "daily_etl",
"email_notifications": {
"on_failure": ["oncall@company.com"],
"on_success": [],
"on_start": [],
"no_alert_for_skipped_runs": true
},
"webhook_notifications": {
"on_failure": [{"id": "your-webhook-id"}]
}
}'
Alert triggers available:
| Trigger | When to Use |
|---|---|
on_failure | Always — this is the baseline |
on_success | Useful for SLA confirmation |
on_start | High-value long-running jobs |
on_duration_warning_threshold_exceeded | Catch slow jobs before they fail |
The duration threshold alert is underused but extremely valuable. Set it to 20% above your p95 runtime:
# Set duration alert via Python SDK
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w.jobs.update(
job_id=12345,
new_settings={
"health": {
"rules": [
{
"metric": "RUN_DURATION_SECONDS",
"op": "GREATER_THAN",
"value": 3600 # Alert if job runs longer than 1 hour
}
]
}
}
)
Webhook Notifications for Slack and PagerDuty
For production systems, email is not enough. Wire job failures to Slack and PagerDuty:
# Create a Webhook destination (Databricks Notification Destinations API)
import requests
headers = {
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
"Content-Type": "application/json"
}
# Create Slack webhook destination
payload = {
"display_name": "Data Engineering Slack",
"slack": {
"url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}
}
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/notification-destinations",
headers=headers,
json=payload
)
destination_id = response.json()["id"]
print(f"Webhook destination created: {destination_id}")
Layer 2: Cluster Health Monitoring
Ganglia Metrics
Every Databricks cluster ships with Ganglia, accessible from the Cluster UI → Metrics tab. Key metrics to watch:
| Metric | Normal Range | Action if Exceeded |
|---|---|---|
| CPU utilization | 60-85% | Scale out or upgrade instance type |
| Memory utilization | < 80% | Increase memory or tune Spark config |
| Disk I/O wait | < 20% | Reduce shuffle spill, add local SSD |
| Network bytes in/out | Baseline dependent | Check shuffle-heavy stages |
CloudWatch / Azure Monitor Integration
For persistent metric storage and alerting, ship cluster metrics to your cloud provider's monitoring service:
# init_script.sh — install and configure collectd to ship to CloudWatch
# Add this as a cluster init script in Databricks
cat > /etc/collectd/collectd.conf << EOF
LoadPlugin ganglia
LoadPlugin write_cloudwatch
<Plugin ganglia>
Host "localhost"
Port "8649"
</Plugin>
<Plugin write_cloudwatch>
Region "us-east-1"
Namespace "Databricks/Clusters"
</Plugin>
EOF
service collectd restart
Auto-termination and Cost Alerts
Always set auto-termination on interactive clusters — they're the #1 source of surprise cloud bills:
# Set auto-termination via CLI
databricks clusters edit --json '{
"cluster_id": "0123-456789-abc123",
"autotermination_minutes": 30
}'
Layer 3: Spark Metrics and Application-Level Monitoring
Enabling the Spark Metrics Sink
Spark exposes a rich set of metrics via its metrics system. Enable the Prometheus sink to scrape them:
# spark_metrics_config.py — add to your cluster Spark config
spark_conf = {
"spark.metrics.conf.*.sink.prometheussink.class":
"org.apache.spark.metrics.sink.PrometheusServlet",
"spark.metrics.conf.*.sink.prometheussink.path": "/metrics/prometheus",
"spark.ui.prometheus.enabled": "true"
}
Key Spark metrics every Data Engineer should track:
| Metric | What It Signals |
|---|---|
executor.taskFailures | Data quality issues or OOM errors |
executor.shuffleReadBytes | Wide transformations, potential bottleneck |
executor.memoryUsed | Approaching OOM threshold |
executor.diskBytesSpilled | Memory pressure — increase executor memory |
driver.BlockManager.memory.remainingMem_MB | Driver memory health |
Custom Application Metrics with Spark Listeners
For fine-grained job-level metrics, implement a custom Spark listener:
from pyspark import SparkContext
from pyspark.listener import SparkListener
class JobMetricsListener(SparkListener):
def __init__(self):
self.job_start_times = {}
self.job_durations = {}
def onJobStart(self, job_start):
job_id = job_start.jobId()
self.job_start_times[job_id] = job_start.time()
print(f"Job {job_id} started at {job_start.time()}")
def onJobEnd(self, job_end):
job_id = job_end.jobId()
duration_ms = job_end.time() - self.job_start_times.get(job_id, job_end.time())
self.job_durations[job_id] = duration_ms
status = "succeeded" if job_end.jobResult().toString() == "JobSucceeded" else "failed"
print(f"Job {job_id} {status} in {duration_ms}ms")
# Ship to your metrics backend here
listener = JobMetricsListener()
sc = spark.sparkContext
sc._jsc.sc().addSparkListener(listener._jlistener)
Datadog Integration
Datadog is the most popular external observability platform for Databricks shops. The official integration ships cluster metrics, job metrics, and logs:
# datadog_init_script.sh — Databricks cluster init script
DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=$DD_API_KEY DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"
# Configure Spark integration
cat > /etc/datadog-agent/conf.d/spark.d/conf.yaml << EOF
init_config:
instances:
- spark_url: http://localhost:4040
spark_cluster_mode: spark_standalone_mode
cluster_name: databricks_cluster
streaming_metrics: true
histogram_metrics: true
EOF
service datadog-agent restart
Layer 4: Data Quality Monitoring
Infrastructure health is only half the picture. Your pipelines can be "green" while silently producing garbage data.
Row Count Monitoring
# data_quality_checks.py
from pyspark.sql import functions as F
def check_row_count(table_name: str, min_rows: int, date_col: str = "date",
date_val: str = None):
df = spark.table(table_name)
if date_val:
df = df.filter(F.col(date_col) == date_val)
count = df.count()
if count < min_rows:
raise ValueError(
f"Data quality FAIL: {table_name} has {count} rows "
f"(expected >= {min_rows}) for {date_val}"
)
print(f" OK {table_name}: {count} rows")
# Run as part of your pipeline
check_row_count("catalog.schema.events", min_rows=100_000, date_val="2024-01-15")
check_row_count("catalog.schema.users", min_rows=1_000, date_val="2024-01-15")
Schema Drift Detection
import json
from datetime import date
def detect_schema_drift(table_name: str, expected_schema_path: str):
current_schema = spark.table(table_name).schema.jsonValue()
with open(expected_schema_path) as f:
expected_schema = json.load(f)
current_fields = {f["name"]: f["type"] for f in current_schema["fields"]}
expected_fields = {f["name"]: f["type"] for f in expected_schema["fields"]}
added = set(current_fields) - set(expected_fields)
removed = set(expected_fields) - set(current_fields)
if added or removed:
raise ValueError(
f"Schema drift detected in {table_name}:\n"
f" Added: {added}\n"
f" Removed: {removed}"
)
print(f" Schema OK: {table_name}")
Building an Alerting Runbook
Every alert should have a corresponding runbook. Here's a template for the most common Databricks alerts:
Alert: Job Failed
Severity: P2
Runbook:
1. Check Databricks Workflow run details (UI → Workflows → Run History)
2. Inspect driver logs for stack trace
3. Check if upstream tables were updated (DESCRIBE HISTORY)
4. Check cluster metrics at time of failure (Ganglia → Cluster Events)
5. Re-run with --repair-run if data is intact
Alert: Job Duration > Threshold
Severity: P3
Runbook:
1. Open Spark UI → Stages → identify slowest stage
2. Check shuffle read/write bytes (excessive = data skew or missing partition pruning)
3. Check if input data volume grew (row count vs historical baseline)
4. Review recent code changes (git log)
5. Check if cluster was preempted (Cluster Events → Spot interruptions)
Putting It All Together: A Monitoring Checklist
Use this checklist when launching a new Databricks pipeline to production:
- Job failure alert wired to Slack/PagerDuty
- Duration threshold alert set (p95 runtime + 20%)
- Auto-termination set on interactive clusters
- Data quality checks (row count, nulls, schema) embedded in pipeline
- Spark metrics sink enabled and shipping to Datadog/CloudWatch
- DESCRIBE HISTORY accessible for last 30 days
- Runbook documented and linked in alert message
Final Thoughts
Monitoring Databricks workloads is a journey, not a destination. Start with the basics (job failure alerts + Slack), then progressively instrument Spark metrics, data quality checks, and cluster health as your platform matures.
Managing this complexity at scale — across dozens of jobs, multiple clusters, and terabytes of Delta tables — is exactly the challenge Harbinger Explorer was built to solve. It surfaces job health, cluster trends, and data quality signals across your entire workspace so your team can move fast without flying blind.
Try Harbinger Explorer free for 7 days and get your Databricks observability sorted in under an hour.
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial