Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Asset Bundles (DABs): The Complete Deployment Guide

13 min read·Tags: databricks, asset-bundles, dabs, cicd, infrastructure-as-code, deployment, data-engineering

Databricks Asset Bundles (DABs): The Complete Deployment Guide

Databricks Asset Bundles (DABs) are Databricks' answer to infrastructure-as-code for data pipelines. With DABs, you define your entire Databricks project — jobs, Delta Live Tables pipelines, notebooks, clusters, permissions — in YAML and Python, deploy it consistently across environments, and version everything in Git.

This guide covers everything from project initialization to production CI/CD pipelines.


What Are Databricks Asset Bundles?

A bundle is a collection of Databricks resources defined in YAML that can be deployed as a single unit. Think of it as Terraform for Databricks — but with first-class support for Databricks-specific concepts like jobs, pipelines, and notebooks.

What you can define in a bundle:

Resource TypeDescription
jobsDatabricks Workflows (job clusters, task dependencies)
pipelinesDelta Live Tables (DLT) pipelines
experimentsMLflow experiments
modelsMLflow model registrations
clustersAll-Purpose cluster configurations
dashboardsDatabricks SQL dashboards
permissionsAccess control for any resource

Installation and Setup

# Install Databricks CLI (includes DABs support, v0.18+)
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Verify installation
databricks --version

# Authenticate
databricks configure
# Prompts for host and token, or use environment variables:
export DATABRICKS_HOST="https://adb-WORKSPACE_ID.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi..."

Project Initialization

# Initialize a new bundle project
databricks bundle init

# Choose from templates:
# 1. Default Python  (recommended starting point)
# 2. DLT Python
# 3. MLOps Stacks

cd my-databricks-project
ls -la
# databricks.yml        -- main bundle config
# src/                  -- Python source files
# resources/            -- YAML resource definitions
# tests/                -- unit tests
# .github/workflows/    -- CI/CD (if selected)

Bundle Configuration Deep Dive

databricks.yml — The Root Config

bundle:
  name: harbinger-data-platform

# Define environments (targets)
targets:
  dev:
    mode: development  # adds [dev username] prefix to resources
    default: true
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net

  staging:
    mode: production
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net
    variables:
      env: staging
      cluster_size: "2g"

  prod:
    mode: production
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net
    variables:
      env: prod
      cluster_size: "8g"
    permissions:
      - group_name: data-engineers
        level: CAN_MANAGE_RUN
      - group_name: data-analysts
        level: CAN_VIEW

# Variable definitions with defaults
variables:
  env:
    description: "Target environment"
    default: dev
  cluster_size:
    description: "Worker node count"
    default: "2g"

# Include resource files
include:
  - resources/jobs/*.yml
  - resources/pipelines/*.yml
  - resources/clusters/*.yml

Defining Jobs

# resources/jobs/medallion_pipeline.yml
resources:
  jobs:
    medallion_pipeline:
      name: "Medallion Pipeline [${var.env}]"

      email_notifications:
        on_failure:
          - data-alerts@yourcompany.com
        on_success:
          - data-reports@yourcompany.com

      health:
        rules:
          - metric: RUN_DURATION_SECONDS
            op: GREATER_THAN
            value: 7200

      trigger:
        pause_status: UNPAUSED
        periodic:
          interval: 1
          unit: DAYS

      tasks:
        - task_key: bronze_ingestion
          description: "Ingest raw events from landing zone"
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 4
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: "true"
          notebook_task:
            notebook_path: ./src/notebooks/bronze_ingestion
            base_parameters:
              env: ${var.env}
          libraries:
            - pypi:
                package: delta-spark==3.0.0

        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingestion
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS4_v2"
            num_workers: 8
            azure_attributes:
              availability: SPOT_WITH_FALLBACK_AZURE
          python_wheel_task:
            package_name: harbinger_transforms
            entry_point: run_silver
            parameters:
              - "--env"
              - ${var.env}
          libraries:
            - whl: ./dist/harbinger_transforms-*.whl

        - task_key: gold_aggregation
          depends_on:
            - task_key: silver_transform
          sql_task:
            warehouse_id: ${var.sql_warehouse_id}
            query:
              query_id: ""

        - task_key: dq_validation
          depends_on:
            - task_key: gold_aggregation
          notebook_task:
            notebook_path: ./src/notebooks/data_quality_check
          run_if: ALL_SUCCESS

Defining DLT Pipelines

# resources/pipelines/streaming_events.yml
resources:
  pipelines:
    streaming_events:
      name: "Streaming Events Pipeline [${var.env}]"
      target: "prod_silver"
      catalog: "prod"

      development: ${bundle.target == 'dev'}

      continuous: false

      clusters:
        - label: default
          num_workers: 4
          node_type_id: "Standard_DS3_v2"

      libraries:
        - notebook:
            path: ./src/dlt/streaming_pipeline
        - notebook:
            path: ./src/dlt/quality_expectations

      configuration:
        pipelines.applyChangesPreviewEnabled: "true"
        spark.databricks.delta.optimizeWrite.enabled: "true"

Python Source Structure

src/
├── notebooks/
│   ├── bronze_ingestion.py
│   ├── silver_transform.py
│   └── data_quality_check.py
├── dlt/
│   ├── streaming_pipeline.py
│   └── quality_expectations.py
├── transforms/
│   ├── __init__.py
│   ├── bronze.py
│   ├── silver.py
│   └── gold.py
└── utils/
    ├── __init__.py
    ├── spark_helpers.py
    └── schema_registry.py

Writing Bundle-Compatible Notebooks

# src/notebooks/bronze_ingestion.py
# Databricks notebook source

# COMMAND ----------
# Parameters (injected by Databricks Workflows)
dbutils.widgets.text("env", "dev")
dbutils.widgets.text("batch_date", "2024-01-01")

env = dbutils.widgets.get("env")
batch_date = dbutils.widgets.get("batch_date")

print(f"Running bronze ingestion for env={env}, date={batch_date}")

# COMMAND ----------
from pyspark.sql.functions import current_timestamp, input_file_name, lit

source_path = f"abfss://landing@{env}storage.dfs.core.windows.net/events/{batch_date}/"
target_table = f"{env}.bronze.events_raw"

(
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"/mnt/checkpoints/{env}/bronze/events/schema")
    .load(source_path)
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .withColumn("_env", lit(env))
    .writeStream
    .format("delta")
    .option("checkpointLocation", f"/mnt/checkpoints/{env}/bronze/events/stream")
    .option("mergeSchema", "true")
    .outputMode("append")
    .trigger(availableNow=True)
    .table(target_table)
    .awaitTermination()
)

print(f"Bronze ingestion complete: {target_table}")

CLI Workflow

# Validate bundle configuration (syntax check)
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Deploy to specific target
databricks bundle deploy --target staging
databricks bundle deploy --target prod

# Run a specific job after deployment
databricks bundle run medallion_pipeline

# Run with parameter overrides
databricks bundle run medallion_pipeline \
  --python-params '["--env", "staging", "--batch-date", "2024-01-15"]'

# Watch job status
databricks bundle run medallion_pipeline --watch

# Destroy all deployed resources
databricks bundle destroy --target dev --auto-approve

CI/CD with GitHub Actions

# .github/workflows/databricks-deploy.yml
name: Deploy Databricks Bundle

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  validate:
    name: Validate Bundle
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Validate Bundle
        run: databricks bundle validate

      - name: Run Unit Tests
        run: |
          pip install pytest pyspark delta-spark
          pytest tests/ -v --tb=short

  deploy-staging:
    name: Deploy to Staging
    needs: validate
    if: github.ref == 'refs/heads/staging'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Build Python Wheel
        run: |
          pip install build
          python -m build

      - name: Deploy to Staging
        run: databricks bundle deploy --target staging
        env:
          DATABRICKS_TOKEN: ${{ secrets.STAGING_DATABRICKS_TOKEN }}

      - name: Run Integration Test
        run: |
          databricks bundle run smoke_test_job --target staging --watch

  deploy-prod:
    name: Deploy to Production
    needs: validate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Build Python Wheel
        run: |
          pip install build
          python -m build

      - name: Deploy to Production
        run: databricks bundle deploy --target prod
        env:
          DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }}

Variable Management and Secrets

# databricks.yml — reference variables
variables:
  sql_warehouse_id:
    description: "SQL Warehouse ID for Gold queries"
    # Set via environment variable: BUNDLE_VAR_sql_warehouse_id

  db_password:
    description: "Database password"
    # Use Databricks secret scope — never put secrets in YAML
# Inject variable at deploy time
BUNDLE_VAR_sql_warehouse_id=d3f8f59331f78ac5 databricks bundle deploy --target prod

For secrets, always use Databricks Secret Scopes (never YAML):

# In notebook/Python code
db_password = dbutils.secrets.get(scope="harbinger", key="db-password")

Bundle Testing Strategies

Unit Tests (No Cluster Required)

# tests/test_silver_transform.py
import pytest
from pyspark.sql import SparkSession
from src.transforms.silver import deduplicate_events

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[2]").appName("test").getOrCreate()

def test_deduplication(spark):
    data = [
        ("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0),
        ("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0),  # duplicate
        ("evt_002", "2024-01-15T11:00:00", "CLICK", 0.0),
    ]
    df = spark.createDataFrame(data, ["event_id", "event_ts", "event_type", "amount"])
    result = deduplicate_events(df, key_col="event_id")
    assert result.count() == 2  # duplicate removed

Integration Tests (Require Cluster)

# Run a lightweight integration test job defined in the bundle
databricks bundle run integration_test --target staging --watch

# Check exit code
echo "Exit code: $?"

Monitoring Deployed Bundles

Track your deployed resources via System Tables:

SELECT
  job_id,
  job_name,
  run_id,
  state.result_state,
  start_time,
  end_time,
  DATEDIFF(second, start_time, end_time) AS duration_sec
FROM system.workflow.job_run_timeline
WHERE job_name LIKE '%Medallion Pipeline%'
  AND DATE(start_time) >= CURRENT_DATE - 7
ORDER BY start_time DESC;

Connect your bundle deployments to Harbinger Explorer for centralized monitoring — track job run history, detect failures, and correlate deployment events with pipeline anomalies across environments.


Common Pitfalls

ProblemSolution
Bundle validate passes but deploy failsCheck workspace permissions; ensure service principal has CAN_MANAGE on workspace
Notebook paths not foundUse ./ relative paths from bundle root, not absolute workspace paths
Variable not injectedCheck BUNDLE_VAR_<name> env var convention (uppercase)
Dev mode name conflictsmode: development adds [dev username] prefix — don't hardcode resource names
Wheel not foundBuild wheel before deploy; ensure dist/*.whl is included

Conclusion

Databricks Asset Bundles are the right way to manage Databricks infrastructure at scale. They bring software engineering discipline — version control, CI/CD, environment parity, code review — to data platform deployments.

The learning curve is worth it: once your team adopts DABs, deployments become reproducible, rollbacks become trivial, and cross-environment consistency is guaranteed. Start with a single job, then expand to your full platform incrementally.


Try Harbinger Explorer free for 7 days — get full visibility into your DABs-deployed resources, track deployments across environments, and monitor job health without custom tooling. Start your free trial at harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...