Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Monitoring and Alerting for Databricks Workloads: A Complete Guide

11 min read·Tags: databricks, monitoring, alerting, observability, spark, jobs

Monitoring and Alerting for Databricks Workloads: A Complete Guide

Running Databricks in production without proper monitoring is like flying blind. Jobs fail silently, clusters run out of memory at 3 AM, and pipelines stall for hours before anyone notices. Building a solid observability layer is not optional — it's the difference between a reliable data platform and a constant firefighting exercise.

This guide covers everything you need to instrument, monitor, and alert on Databricks workloads: from native job alerts to Spark metrics, cluster health, and integration with external platforms like Datadog, PagerDuty, and Slack.


The Observability Stack for Databricks

A production-grade observability setup for Databricks spans four layers:

LayerWhat to MonitorTools
Job/WorkflowRun status, duration, failure rateDatabricks Alerts, Workflows UI
ClusterCPU, memory, disk, GC pressureGanglia, Spark UI, CloudWatch/Azure Monitor
Spark MetricsTask failures, shuffle read/write, spillSpark metrics sink, Datadog
Data QualityRow counts, null rates, schema driftGreat Expectations, dbt tests, custom checks

Most teams instrument only the first layer and wonder why they still get surprised by failures. True observability requires all four.


Layer 1: Job and Workflow Alerts

Native Databricks Job Alerts

Databricks Workflows has built-in alerting for job runs. You can configure alerts via the UI or the Jobs API:

# Create a job with email alerts via CLI
databricks jobs create --json '{
  "name": "daily_etl",
  "email_notifications": {
    "on_failure": ["oncall@company.com"],
    "on_success": [],
    "on_start": [],
    "no_alert_for_skipped_runs": true
  },
  "webhook_notifications": {
    "on_failure": [{"id": "your-webhook-id"}]
  }
}'

Alert triggers available:

TriggerWhen to Use
on_failureAlways — this is the baseline
on_successUseful for SLA confirmation
on_startHigh-value long-running jobs
on_duration_warning_threshold_exceededCatch slow jobs before they fail

The duration threshold alert is underused but extremely valuable. Set it to 20% above your p95 runtime:

# Set duration alert via Python SDK
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

w.jobs.update(
    job_id=12345,
    new_settings={
        "health": {
            "rules": [
                {
                    "metric": "RUN_DURATION_SECONDS",
                    "op": "GREATER_THAN",
                    "value": 3600  # Alert if job runs longer than 1 hour
                }
            ]
        }
    }
)

Webhook Notifications for Slack and PagerDuty

For production systems, email is not enough. Wire job failures to Slack and PagerDuty:

# Create a Webhook destination (Databricks Notification Destinations API)
import requests

headers = {
    "Authorization": f"Bearer {DATABRICKS_TOKEN}",
    "Content-Type": "application/json"
}

# Create Slack webhook destination
payload = {
    "display_name": "Data Engineering Slack",
    "slack": {
        "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/notification-destinations",
    headers=headers,
    json=payload
)
destination_id = response.json()["id"]
print(f"Webhook destination created: {destination_id}")

Layer 2: Cluster Health Monitoring

Ganglia Metrics

Every Databricks cluster ships with Ganglia, accessible from the Cluster UI → Metrics tab. Key metrics to watch:

MetricNormal RangeAction if Exceeded
CPU utilization60-85%Scale out or upgrade instance type
Memory utilization< 80%Increase memory or tune Spark config
Disk I/O wait< 20%Reduce shuffle spill, add local SSD
Network bytes in/outBaseline dependentCheck shuffle-heavy stages

CloudWatch / Azure Monitor Integration

For persistent metric storage and alerting, ship cluster metrics to your cloud provider's monitoring service:

# init_script.sh — install and configure collectd to ship to CloudWatch
# Add this as a cluster init script in Databricks

cat > /etc/collectd/collectd.conf << EOF
LoadPlugin ganglia
LoadPlugin write_cloudwatch

<Plugin ganglia>
  Host "localhost"
  Port "8649"
</Plugin>

<Plugin write_cloudwatch>
  Region "us-east-1"
  Namespace "Databricks/Clusters"
</Plugin>
EOF

service collectd restart

Auto-termination and Cost Alerts

Always set auto-termination on interactive clusters — they're the #1 source of surprise cloud bills:

# Set auto-termination via CLI
databricks clusters edit --json '{
  "cluster_id": "0123-456789-abc123",
  "autotermination_minutes": 30
}'

Layer 3: Spark Metrics and Application-Level Monitoring

Enabling the Spark Metrics Sink

Spark exposes a rich set of metrics via its metrics system. Enable the Prometheus sink to scrape them:

# spark_metrics_config.py — add to your cluster Spark config
spark_conf = {
    "spark.metrics.conf.*.sink.prometheussink.class":
        "org.apache.spark.metrics.sink.PrometheusServlet",
    "spark.metrics.conf.*.sink.prometheussink.path": "/metrics/prometheus",
    "spark.ui.prometheus.enabled": "true"
}

Key Spark metrics every Data Engineer should track:

MetricWhat It Signals
executor.taskFailuresData quality issues or OOM errors
executor.shuffleReadBytesWide transformations, potential bottleneck
executor.memoryUsedApproaching OOM threshold
executor.diskBytesSpilledMemory pressure — increase executor memory
driver.BlockManager.memory.remainingMem_MBDriver memory health

Custom Application Metrics with Spark Listeners

For fine-grained job-level metrics, implement a custom Spark listener:

from pyspark import SparkContext
from pyspark.listener import SparkListener

class JobMetricsListener(SparkListener):
    def __init__(self):
        self.job_start_times = {}
        self.job_durations = {}

    def onJobStart(self, job_start):
        job_id = job_start.jobId()
        self.job_start_times[job_id] = job_start.time()
        print(f"Job {job_id} started at {job_start.time()}")

    def onJobEnd(self, job_end):
        job_id = job_end.jobId()
        duration_ms = job_end.time() - self.job_start_times.get(job_id, job_end.time())
        self.job_durations[job_id] = duration_ms
        status = "succeeded" if job_end.jobResult().toString() == "JobSucceeded" else "failed"
        print(f"Job {job_id} {status} in {duration_ms}ms")
        # Ship to your metrics backend here

listener = JobMetricsListener()
sc = spark.sparkContext
sc._jsc.sc().addSparkListener(listener._jlistener)

Datadog Integration

Datadog is the most popular external observability platform for Databricks shops. The official integration ships cluster metrics, job metrics, and logs:

# datadog_init_script.sh — Databricks cluster init script
DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=$DD_API_KEY DD_SITE="datadoghq.com" \
  bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

# Configure Spark integration
cat > /etc/datadog-agent/conf.d/spark.d/conf.yaml << EOF
init_config:

instances:
  - spark_url: http://localhost:4040
    spark_cluster_mode: spark_standalone_mode
    cluster_name: databricks_cluster
    streaming_metrics: true
    histogram_metrics: true
EOF

service datadog-agent restart

Layer 4: Data Quality Monitoring

Infrastructure health is only half the picture. Your pipelines can be "green" while silently producing garbage data.

Row Count Monitoring

# data_quality_checks.py
from pyspark.sql import functions as F

def check_row_count(table_name: str, min_rows: int, date_col: str = "date",
                    date_val: str = None):
    df = spark.table(table_name)
    if date_val:
        df = df.filter(F.col(date_col) == date_val)

    count = df.count()
    if count < min_rows:
        raise ValueError(
            f"Data quality FAIL: {table_name} has {count} rows "
            f"(expected >= {min_rows}) for {date_val}"
        )
    print(f"  OK {table_name}: {count} rows")

# Run as part of your pipeline
check_row_count("catalog.schema.events", min_rows=100_000, date_val="2024-01-15")
check_row_count("catalog.schema.users",  min_rows=1_000,   date_val="2024-01-15")

Schema Drift Detection

import json
from datetime import date

def detect_schema_drift(table_name: str, expected_schema_path: str):
    current_schema = spark.table(table_name).schema.jsonValue()

    with open(expected_schema_path) as f:
        expected_schema = json.load(f)

    current_fields = {f["name"]: f["type"] for f in current_schema["fields"]}
    expected_fields = {f["name"]: f["type"] for f in expected_schema["fields"]}

    added = set(current_fields) - set(expected_fields)
    removed = set(expected_fields) - set(current_fields)

    if added or removed:
        raise ValueError(
            f"Schema drift detected in {table_name}:\n"
            f"  Added: {added}\n"
            f"  Removed: {removed}"
        )
    print(f"  Schema OK: {table_name}")

Building an Alerting Runbook

Every alert should have a corresponding runbook. Here's a template for the most common Databricks alerts:

Alert: Job Failed

Severity: P2
Runbook:
1. Check Databricks Workflow run details (UI → Workflows → Run History)
2. Inspect driver logs for stack trace
3. Check if upstream tables were updated (DESCRIBE HISTORY)
4. Check cluster metrics at time of failure (Ganglia → Cluster Events)
5. Re-run with --repair-run if data is intact

Alert: Job Duration > Threshold

Severity: P3
Runbook:
1. Open Spark UI → Stages → identify slowest stage
2. Check shuffle read/write bytes (excessive = data skew or missing partition pruning)
3. Check if input data volume grew (row count vs historical baseline)
4. Review recent code changes (git log)
5. Check if cluster was preempted (Cluster Events → Spot interruptions)

Putting It All Together: A Monitoring Checklist

Use this checklist when launching a new Databricks pipeline to production:

  • Job failure alert wired to Slack/PagerDuty
  • Duration threshold alert set (p95 runtime + 20%)
  • Auto-termination set on interactive clusters
  • Data quality checks (row count, nulls, schema) embedded in pipeline
  • Spark metrics sink enabled and shipping to Datadog/CloudWatch
  • DESCRIBE HISTORY accessible for last 30 days
  • Runbook documented and linked in alert message

Final Thoughts

Monitoring Databricks workloads is a journey, not a destination. Start with the basics (job failure alerts + Slack), then progressively instrument Spark metrics, data quality checks, and cluster health as your platform matures.

Managing this complexity at scale — across dozens of jobs, multiple clusters, and terabytes of Delta tables — is exactly the challenge Harbinger Explorer was built to solve. It surfaces job health, cluster trends, and data quality signals across your entire workspace so your team can move fast without flying blind.

Try Harbinger Explorer free for 7 days and get your Databricks observability sorted in under an hour.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...