databricks

Published: Apr 3, 2026

Monitoring and Alerting for Databricks Workloads: A Complete Guide

11 min read·Tags: databricks, monitoring, alerting, observability, spark, jobs

Monitoring and Alerting for Databricks Workloads: A Complete Guide

Running Databricks in production without proper monitoring is like flying blind. Jobs fail silently, clusters run out of memory at 3 AM, and pipelines stall for hours before anyone notices. Building a solid observability layer is not optional — it's the difference between a reliable data platform and a constant firefighting exercise.

This guide covers everything you need to instrument, monitor, and alert on Databricks workloads: from native job alerts to Spark metrics, cluster health, and integration with external platforms like Datadog, PagerDuty, and Slack.

The Observability Stack for Databricks

A production-grade observability setup for Databricks spans four layers:

Layer	What to Monitor	Tools
Job/Workflow	Run status, duration, failure rate	Databricks Alerts, Workflows UI
Cluster	CPU, memory, disk, GC pressure	Ganglia, Spark UI, CloudWatch/Azure Monitor
Spark Metrics	Task failures, shuffle read/write, spill	Spark metrics sink, Datadog
Data Quality	Row counts, null rates, schema drift	Great Expectations, dbt tests, custom checks

Most teams instrument only the first layer and wonder why they still get surprised by failures. True observability requires all four.

Layer 1: Job and Workflow Alerts

Native Databricks Job Alerts

Databricks Workflows has built-in alerting for job runs. You can configure alerts via the UI or the Jobs API:

# Create a job with email alerts via CLI
databricks jobs create --json '{
  "name": "daily_etl",
  "email_notifications": {
    "on_failure": ["oncall@company.com"],
    "on_success": [],
    "on_start": [],
    "no_alert_for_skipped_runs": true
  },
  "webhook_notifications": {
    "on_failure": [{"id": "your-webhook-id"}]
  }
}'

Alert triggers available:

Trigger	When to Use
`on_failure`	Always — this is the baseline
`on_success`	Useful for SLA confirmation
`on_start`	High-value long-running jobs
`on_duration_warning_threshold_exceeded`	Catch slow jobs before they fail

The duration threshold alert is underused but extremely valuable. Set it to 20% above your p95 runtime:

# Set duration alert via Python SDK
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

w.jobs.update(
    job_id=12345,
    new_settings={
        "health": {
            "rules": [
                {
                    "metric": "RUN_DURATION_SECONDS",
                    "op": "GREATER_THAN",
                    "value": 3600  # Alert if job runs longer than 1 hour
                }
            ]
        }
    }
)

Webhook Notifications for Slack and PagerDuty

For production systems, email is not enough. Wire job failures to Slack and PagerDuty:

# Create a Webhook destination (Databricks Notification Destinations API)
import requests

headers = {
    "Authorization": f"Bearer {DATABRICKS_TOKEN}",
    "Content-Type": "application/json"
}

# Create Slack webhook destination
payload = {
    "display_name": "Data Engineering Slack",
    "slack": {
        "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/notification-destinations",
    headers=headers,
    json=payload
)
destination_id = response.json()["id"]
print(f"Webhook destination created: {destination_id}")

Layer 2: Cluster Health Monitoring

Ganglia Metrics

Every Databricks cluster ships with Ganglia, accessible from the Cluster UI → Metrics tab. Key metrics to watch:

Metric	Normal Range	Action if Exceeded
CPU utilization	60-85%	Scale out or upgrade instance type
Memory utilization	< 80%	Increase memory or tune Spark config
Disk I/O wait	< 20%	Reduce shuffle spill, add local SSD
Network bytes in/out	Baseline dependent	Check shuffle-heavy stages

CloudWatch / Azure Monitor Integration

For persistent metric storage and alerting, ship cluster metrics to your cloud provider's monitoring service:

# init_script.sh — install and configure collectd to ship to CloudWatch
# Add this as a cluster init script in Databricks

cat > /etc/collectd/collectd.conf << EOF
LoadPlugin ganglia
LoadPlugin write_cloudwatch

<Plugin ganglia>
  Host "localhost"
  Port "8649"
</Plugin>

<Plugin write_cloudwatch>
  Region "us-east-1"
  Namespace "Databricks/Clusters"
</Plugin>
EOF

service collectd restart

Auto-termination and Cost Alerts

Always set auto-termination on interactive clusters — they're the #1 source of surprise cloud bills:

# Set auto-termination via CLI
databricks clusters edit --json '{
  "cluster_id": "0123-456789-abc123",
  "autotermination_minutes": 30
}'

Layer 3: Spark Metrics and Application-Level Monitoring

Enabling the Spark Metrics Sink

Spark exposes a rich set of metrics via its metrics system. Enable the Prometheus sink to scrape them:

# spark_metrics_config.py — add to your cluster Spark config
spark_conf = {
    "spark.metrics.conf.*.sink.prometheussink.class":
        "org.apache.spark.metrics.sink.PrometheusServlet",
    "spark.metrics.conf.*.sink.prometheussink.path": "/metrics/prometheus",
    "spark.ui.prometheus.enabled": "true"
}

Key Spark metrics every Data Engineer should track:

Metric	What It Signals
`executor.taskFailures`	Data quality issues or OOM errors
`executor.shuffleReadBytes`	Wide transformations, potential bottleneck
`executor.memoryUsed`	Approaching OOM threshold
`executor.diskBytesSpilled`	Memory pressure — increase executor memory
`driver.BlockManager.memory.remainingMem_MB`	Driver memory health

Custom Application Metrics with Spark Listeners

For fine-grained job-level metrics, implement a custom Spark listener:

from pyspark import SparkContext
from pyspark.listener import SparkListener

class JobMetricsListener(SparkListener):
    def __init__(self):
        self.job_start_times = {}
        self.job_durations = {}

    def onJobStart(self, job_start):
        job_id = job_start.jobId()
        self.job_start_times[job_id] = job_start.time()
        print(f"Job {job_id} started at {job_start.time()}")

    def onJobEnd(self, job_end):
        job_id = job_end.jobId()
        duration_ms = job_end.time() - self.job_start_times.get(job_id, job_end.time())
        self.job_durations[job_id] = duration_ms
        status = "succeeded" if job_end.jobResult().toString() == "JobSucceeded" else "failed"
        print(f"Job {job_id} {status} in {duration_ms}ms")
        # Ship to your metrics backend here

listener = JobMetricsListener()
sc = spark.sparkContext
sc._jsc.sc().addSparkListener(listener._jlistener)

Datadog Integration

Datadog is the most popular external observability platform for Databricks shops. The official integration ships cluster metrics, job metrics, and logs:

# datadog_init_script.sh — Databricks cluster init script
DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=$DD_API_KEY DD_SITE="datadoghq.com" \
  bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

# Configure Spark integration
cat > /etc/datadog-agent/conf.d/spark.d/conf.yaml << EOF
init_config:

instances:
  - spark_url: http://localhost:4040
    spark_cluster_mode: spark_standalone_mode
    cluster_name: databricks_cluster
    streaming_metrics: true
    histogram_metrics: true
EOF

service datadog-agent restart

Layer 4: Data Quality Monitoring

Infrastructure health is only half the picture. Your pipelines can be "green" while silently producing garbage data.

Row Count Monitoring

# data_quality_checks.py
from pyspark.sql import functions as F

def check_row_count(table_name: str, min_rows: int, date_col: str = "date",
                    date_val: str = None):
    df = spark.table(table_name)
    if date_val:
        df = df.filter(F.col(date_col) == date_val)

    count = df.count()
    if count < min_rows:
        raise ValueError(
            f"Data quality FAIL: {table_name} has {count} rows "
            f"(expected >= {min_rows}) for {date_val}"
        )
    print(f"  OK {table_name}: {count} rows")

# Run as part of your pipeline
check_row_count("catalog.schema.events", min_rows=100_000, date_val="2024-01-15")
check_row_count("catalog.schema.users",  min_rows=1_000,   date_val="2024-01-15")

Schema Drift Detection

import json
from datetime import date

def detect_schema_drift(table_name: str, expected_schema_path: str):
    current_schema = spark.table(table_name).schema.jsonValue()

    with open(expected_schema_path) as f:
        expected_schema = json.load(f)

    current_fields = {f["name"]: f["type"] for f in current_schema["fields"]}
    expected_fields = {f["name"]: f["type"] for f in expected_schema["fields"]}

    added = set(current_fields) - set(expected_fields)
    removed = set(expected_fields) - set(current_fields)

    if added or removed:
        raise ValueError(
            f"Schema drift detected in {table_name}:\n"
            f"  Added: {added}\n"
            f"  Removed: {removed}"
        )
    print(f"  Schema OK: {table_name}")

Building an Alerting Runbook

Every alert should have a corresponding runbook. Here's a template for the most common Databricks alerts:

Alert: Job Failed

Severity: P2
Runbook:
1. Check Databricks Workflow run details (UI → Workflows → Run History)
2. Inspect driver logs for stack trace
3. Check if upstream tables were updated (DESCRIBE HISTORY)
4. Check cluster metrics at time of failure (Ganglia → Cluster Events)
5. Re-run with --repair-run if data is intact

Alert: Job Duration > Threshold

Severity: P3
Runbook:
1. Open Spark UI → Stages → identify slowest stage
2. Check shuffle read/write bytes (excessive = data skew or missing partition pruning)
3. Check if input data volume grew (row count vs historical baseline)
4. Review recent code changes (git log)
5. Check if cluster was preempted (Cluster Events → Spot interruptions)

Putting It All Together: A Monitoring Checklist

Use this checklist when launching a new Databricks pipeline to production:

Job failure alert wired to Slack/PagerDuty
Duration threshold alert set (p95 runtime + 20%)
Auto-termination set on interactive clusters
Data quality checks (row count, nulls, schema) embedded in pipeline
Spark metrics sink enabled and shipping to Datadog/CloudWatch
DESCRIBE HISTORY accessible for last 30 days
Runbook documented and linked in alert message

Final Thoughts

Monitoring Databricks workloads is a journey, not a destination. Start with the basics (job failure alerts + Slack), then progressively instrument Spark metrics, data quality checks, and cluster health as your platform matures.

Managing this complexity at scale — across dozens of jobs, multiple clusters, and terabytes of Delta tables — is exactly the challenge Harbinger Explorer was built to solve. It surfaces job health, cluster trends, and data quality signals across your entire workspace so your team can move fast without flying blind.

Try Harbinger Explorer free for 7 days and get your Databricks observability sorted in under an hour.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Monitoring and Alerting for Databricks Workloads: A Complete Guide

Monitoring and Alerting for Databricks Workloads: A Complete Guide

The Observability Stack for Databricks

Layer 1: Job and Workflow Alerts

Native Databricks Job Alerts

Webhook Notifications for Slack and PagerDuty

Layer 2: Cluster Health Monitoring

Ganglia Metrics

CloudWatch / Azure Monitor Integration

Auto-termination and Cost Alerts

Layer 3: Spark Metrics and Application-Level Monitoring

Enabling the Spark Metrics Sink

Custom Application Metrics with Spark Listeners

Datadog Integration

Layer 4: Data Quality Monitoring

Row Count Monitoring

Schema Drift Detection

Building an Alerting Runbook

Alert: Job Failed

Alert: Job Duration > Threshold

Putting It All Together: A Monitoring Checklist

Final Thoughts

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Monitoring and Alerting for Databricks Workloads: A Complete Guide

The Observability Stack for Databricks

Layer 1: Job and Workflow Alerts

Native Databricks Job Alerts

Webhook Notifications for Slack and PagerDuty

Layer 2: Cluster Health Monitoring

Ganglia Metrics

CloudWatch / Azure Monitor Integration

Auto-termination and Cost Alerts

Layer 3: Spark Metrics and Application-Level Monitoring

Enabling the Spark Metrics Sink

Custom Application Metrics with Spark Listeners

Datadog Integration

Layer 4: Data Quality Monitoring

Row Count Monitoring

Schema Drift Detection

Building an Alerting Runbook

Alert: Job Failed

Alert: Job Duration > Threshold

Putting It All Together: A Monitoring Checklist

Final Thoughts

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Command Palette