Databricks Asset Bundles (DABs): Der komplette Deployment-Guide

Inhaltsverzeichnis24 Abschnitte

Was sind Databricks Asset Bundles?
Installation und Setup
Projekt-Initialisierung
Bundle-Konfiguration im Detail
databricks.yml — die Root-Config
Jobs definieren
DLT-Pipelines definieren
Python-Source-Struktur
Bundle-kompatible Notebooks schreiben
CLI-Workflow
CI/CD mit GitHub Actions
Variablen-Management und Secrets
Test-Strategien für Bundles
Unit-Tests (kein Cluster nötig)
Integration-Tests (Cluster nötig)
Deployte Bundles monitoren
Häufige Pitfalls
FAQ
Was ist der Unterschied zwischen DABs und Terraform für Databricks?
Sollte ich mode: development oder mode: production nutzen?
Können DABs MLflow-Modelle deployen?
Wie versionierst du Wheels in einem Bundle?
Funktionieren DABs mit Unity Catalog?
Fazit

TL;DR: Databricks Asset Bundles (DABs) sind Terraform für Databricks — Jobs, DLT-Pipelines, Notebooks, Cluster und Permissions in YAML/Python definieren, in Git versionieren, per CI/CD über Umgebungen deployen. Lernkurve lohnt sich: Reproduzierbare Deployments, einfache Rollbacks, Cross-Environment-Konsistenz garantiert.

Databricks Asset Bundles (DABs) sind Databricks' Antwort auf Infrastructure-as-Code für Datenpipelines. Mit DABs definierst du dein gesamtes Databricks-Projekt — Jobs, Delta-Live-Tables-Pipelines, Notebooks, Cluster, Permissions — in YAML und Python, deployst es konsistent über Umgebungen und versionierst alles in Git.

Dieser Guide deckt alles ab — von Projektinitialisierung bis zu produktiven CI/CD-Pipelines.

Was sind Databricks Asset Bundles?

Ein Bundle ist eine in YAML definierte Sammlung von Databricks-Ressourcen, die als eine Einheit deployt werden. Stell es dir wie Terraform für Databricks vor — aber mit First-Class-Support für Databricks-spezifische Konzepte wie Jobs, Pipelines und Notebooks.

Was du in einem Bundle definieren kannst:

Resource-Typ	Beschreibung
`jobs`	Databricks Workflows (Job-Cluster, Task-Abhängigkeiten)
`pipelines`	Delta-Live-Tables-(DLT)-Pipelines
`experiments`	MLflow-Experimente
`models`	MLflow-Modell-Registrierungen
`clusters`	All-Purpose-Cluster-Konfigurationen
`dashboards`	Databricks-SQL-Dashboards
`permissions`	Zugriffskontrolle für jede Ressource

Installation und Setup

# Install Databricks CLI (includes DABs support, v0.18+)
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Verify installation
databricks --version

# Authenticate
databricks configure
# Prompts for host and token, or use environment variables:
export DATABRICKS_HOST="https://adb-WORKSPACE_ID.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi..."

Projekt-Initialisierung

# Initialize a new bundle project
databricks bundle init

# Choose from templates:
# 1. Default Python  (recommended starting point)
# 2. DLT Python
# 3. MLOps Stacks

cd my-databricks-project
ls -la
# databricks.yml        -- main bundle config
# src/                  -- Python source files
# resources/            -- YAML resource definitions
# tests/                -- unit tests
# .github/workflows/    -- CI/CD (if selected)

Bundle-Konfiguration im Detail

`databricks.yml` — die Root-Config

bundle:
  name: harbinger-data-platform

# Define environments (targets)
targets:
  dev:
    mode: development  # adds [dev username] prefix to resources
    default: true
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net

  staging:
    mode: production
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net
    variables:
      env: staging
      cluster_size: "2g"

  prod:
    mode: production
    workspace:
      host: https://adb-WORKSPACE_ID.azuredatabricks.net
    variables:
      env: prod
      cluster_size: "8g"
    permissions:
      - group_name: data-engineers
        level: CAN_MANAGE_RUN
      - group_name: data-analysts
        level: CAN_VIEW

# Variable definitions with defaults
variables:
  env:
    description: "Target environment"
    default: dev
  cluster_size:
    description: "Worker node count"
    default: "2g"

# Include resource files
include:
  - resources/jobs/*.yml
  - resources/pipelines/*.yml
  - resources/clusters/*.yml

Jobs definieren

# resources/jobs/medallion_pipeline.yml
resources:
  jobs:
    medallion_pipeline:
      name: "Medallion Pipeline [${var.env}]"

      email_notifications:
        on_failure:
          - data-alerts@yourcompany.com
        on_success:
          - data-reports@yourcompany.com

      health:
        rules:
          - metric: RUN_DURATION_SECONDS
            op: GREATER_THAN
            value: 7200

      trigger:
        pause_status: UNPAUSED
        periodic:
          interval: 1
          unit: DAYS

      tasks:
        - task_key: bronze_ingestion
          description: "Ingest raw events from landing zone"
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 4
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: "true"
          notebook_task:
            notebook_path: ./src/notebooks/bronze_ingestion
            base_parameters:
              env: ${var.env}
          libraries:
            - pypi:
                package: delta-spark==3.0.0

        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingestion
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS4_v2"
            num_workers: 8
            azure_attributes:
              availability: SPOT_WITH_FALLBACK_AZURE
          python_wheel_task:
            package_name: harbinger_transforms
            entry_point: run_silver
            parameters:
              - "--env"
              - ${var.env}
          libraries:
            - whl: ./dist/harbinger_transforms-*.whl

        - task_key: gold_aggregation
          depends_on:
            - task_key: silver_transform
          sql_task:
            warehouse_id: ${var.sql_warehouse_id}
            query:
              query_id: ""

        - task_key: dq_validation
          depends_on:
            - task_key: gold_aggregation
          notebook_task:
            notebook_path: ./src/notebooks/data_quality_check
          run_if: ALL_SUCCESS

DLT-Pipelines definieren

# resources/pipelines/streaming_events.yml
resources:
  pipelines:
    streaming_events:
      name: "Streaming Events Pipeline [${var.env}]"
      target: "prod_silver"
      catalog: "prod"

      development: ${bundle.target == 'dev'}

      continuous: false

      clusters:
        - label: default
          num_workers: 4
          node_type_id: "Standard_DS3_v2"

      libraries:
        - notebook:
            path: ./src/dlt/streaming_pipeline
        - notebook:
            path: ./src/dlt/quality_expectations

      configuration:
        pipelines.applyChangesPreviewEnabled: "true"
        spark.databricks.delta.optimizeWrite.enabled: "true"

Python-Source-Struktur

src/
├── notebooks/
│   ├── bronze_ingestion.py
│   ├── silver_transform.py
│   └── data_quality_check.py
├── dlt/
│   ├── streaming_pipeline.py
│   └── quality_expectations.py
├── transforms/
│   ├── __init__.py
│   ├── bronze.py
│   ├── silver.py
│   └── gold.py
└── utils/
    ├── __init__.py
    ├── spark_helpers.py
    └── schema_registry.py

Bundle-kompatible Notebooks schreiben

# src/notebooks/bronze_ingestion.py
# Databricks notebook source

# COMMAND ----------
# Parameters (injected by Databricks Workflows)
dbutils.widgets.text("env", "dev")
dbutils.widgets.text("batch_date", "2024-01-01")

env = dbutils.widgets.get("env")
batch_date = dbutils.widgets.get("batch_date")

print(f"Running bronze ingestion for env={env}, date={batch_date}")

# COMMAND ----------
from pyspark.sql.functions import current_timestamp, input_file_name, lit

source_path = f"abfss://landing@{env}storage.dfs.core.windows.net/events/{batch_date}/"
target_table = f"{env}.bronze.events_raw"

(
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"/mnt/checkpoints/{env}/bronze/events/schema")
    .load(source_path)
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .withColumn("_env", lit(env))
    .writeStream
    .format("delta")
    .option("checkpointLocation", f"/mnt/checkpoints/{env}/bronze/events/stream")
    .option("mergeSchema", "true")
    .outputMode("append")
    .trigger(availableNow=True)
    .table(target_table)
    .awaitTermination()
)

print(f"Bronze ingestion complete: {target_table}")

CLI-Workflow

# Validate bundle configuration (syntax check)
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Deploy to specific target
databricks bundle deploy --target staging
databricks bundle deploy --target prod

# Run a specific job after deployment
databricks bundle run medallion_pipeline

# Run with parameter overrides
databricks bundle run medallion_pipeline \
  --python-params '["--env", "staging", "--batch-date", "2024-01-15"]'

# Watch job status
databricks bundle run medallion_pipeline --watch

# Destroy all deployed resources
databricks bundle destroy --target dev --auto-approve

CI/CD mit GitHub Actions

# .github/workflows/databricks-deploy.yml
name: Deploy Databricks Bundle

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  validate:
    name: Validate Bundle
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Validate Bundle
        run: databricks bundle validate

      - name: Run Unit Tests
        run: |
          pip install pytest pyspark delta-spark
          pytest tests/ -v --tb=short

  deploy-staging:
    name: Deploy to Staging
    needs: validate
    if: github.ref == 'refs/heads/staging'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Build Python Wheel
        run: |
          pip install build
          python -m build

      - name: Deploy to Staging
        run: databricks bundle deploy --target staging
        env:
          DATABRICKS_TOKEN: ${{ secrets.STAGING_DATABRICKS_TOKEN }}

      - name: Run Integration Test
        run: |
          databricks bundle run smoke_test_job --target staging --watch

  deploy-prod:
    name: Deploy to Production
    needs: validate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main

      - name: Build Python Wheel
        run: |
          pip install build
          python -m build

      - name: Deploy to Production
        run: databricks bundle deploy --target prod
        env:
          DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }}

Variablen-Management und Secrets

# databricks.yml — reference variables
variables:
  sql_warehouse_id:
    description: "SQL Warehouse ID for Gold queries"
    # Set via environment variable: BUNDLE_VAR_sql_warehouse_id

  db_password:
    description: "Database password"
    # Use Databricks secret scope — never put secrets in YAML

# Inject variable at deploy time
BUNDLE_VAR_sql_warehouse_id=d3f8f59331f78ac5 databricks bundle deploy --target prod

Für Secrets immer Databricks Secret Scopes nutzen (nie YAML):

# In notebook/Python code
db_password = dbutils.secrets.get(scope="harbinger", key="db-password")

Test-Strategien für Bundles

Unit-Tests (kein Cluster nötig)

# tests/test_silver_transform.py
import pytest
from pyspark.sql import SparkSession
from src.transforms.silver import deduplicate_events

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[2]").appName("test").getOrCreate()

def test_deduplication(spark):
    data = [
        ("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0),
        ("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0),  # duplicate
        ("evt_002", "2024-01-15T11:00:00", "CLICK", 0.0),
    ]
    df = spark.createDataFrame(data, ["event_id", "event_ts", "event_type", "amount"])
    result = deduplicate_events(df, key_col="event_id")
    assert result.count() == 2  # duplicate removed

Integration-Tests (Cluster nötig)

# Run a lightweight integration test job defined in the bundle
databricks bundle run integration_test --target staging --watch

# Check exit code
echo "Exit code: $?"

Deployte Bundles monitoren

Deployte Ressourcen via System-Tables tracken:

SELECT
  job_id,
  job_name,
  run_id,
  state.result_state,
  start_time,
  end_time,
  DATEDIFF(second, start_time, end_time) AS duration_sec
FROM system.workflow.job_run_timeline
WHERE job_name LIKE '%Medallion Pipeline%'
  AND DATE(start_time) >= CURRENT_DATE - 7
ORDER BY start_time DESC;

Bundle-Deployments mit Harbinger Explorer verknüpfen für zentrales Monitoring — Job-Run-Historie tracken, Failures erkennen und Deployment-Events mit Pipeline-Anomalien über Umgebungen korrelieren.

Häufige Pitfalls

Problem	Lösung
`bundle validate` läuft durch, Deploy schlägt fehl	Workspace-Permissions prüfen; Service-Principal braucht `CAN_MANAGE`
Notebook-Pfade nicht gefunden	`./`-relative Pfade von Bundle-Root nutzen, keine absoluten Workspace-Pfade
Variable nicht injiziert	`BUNDLE_VAR_<name>`-Env-Var-Konvention prüfen (uppercase)
Dev-Modus-Namenskonflikte	`mode: development` ergänzt `[dev username]`-Präfix — Resource-Namen nicht hardcoden
Wheel nicht gefunden	Wheel vor Deploy bauen; `dist/*.whl` sicherstellen

FAQ

Was ist der Unterschied zwischen DABs und Terraform für Databricks?

Terraform hat einen Databricks-Provider, der ebenfalls Workspaces, Cluster und Jobs verwaltet. DABs sind enger an Databricks gebaut, kennen Notebook-Pfade, Wheel-Builds und Dev-Modus-Präfixe. Für reine Databricks-Stacks sind DABs idiomatischer; für hybride Infrastrukturen (AWS-Konten, Networking, Databricks) ist Terraform sinnvoller.

Sollte ich `mode: development` oder `mode: production` nutzen?

development für persönliche Dev-Bundles (Resource-Präfix, Pipelines pausiert), production für Staging/Prod (keine Präfixe, Triggers aktiv).

Können DABs MLflow-Modelle deployen?

Ja, models- und experiments-Ressourcen sind unterstützt. MLOps-Stacks-Template enthält ein End-to-End-Beispiel.

Wie versionierst du Wheels in einem Bundle?

Wheel bei jedem CI-Lauf bauen, Build-Artefakt zu dist/ und im Bundle als whl-Library referenzieren. Wenn Wheel auf S3/ADLS abgelegt: Pfad via Variable injizieren.

Funktionieren DABs mit Unity Catalog?

Ja — catalog- und schema-Felder pro Job-Task setzen, Permissions via UC-Grants außerhalb des Bundles managen (Bundles können UC-Grants noch nicht vollständig deklarieren).

Fazit

Databricks Asset Bundles sind der richtige Weg, Databricks-Infrastruktur bei Scale zu managen. Sie bringen Software-Engineering-Disziplin — Versionskontrolle, CI/CD, Environment-Parität, Code-Review — in Data-Platform-Deployments.

Die Lernkurve ist es wert: Sobald dein Team DABs adoptiert, werden Deployments reproduzierbar, Rollbacks trivial und Cross-Environment-Konsistenz garantiert. Mit einem einzelnen Job starten, dann inkrementell zur vollen Plattform expandieren.

Stand: 14. Mai 2026.

Geschrieben von

Harbinger Team

Cloud-, Data- und AI-Engineer in DACH. Schreibt seit 2018 über infrastrukturkritische Tech-Entscheidungen — keine Marketing- Folien, sondern echte Trade-offs aus Production-Workloads.

Mehr über Marc hello@harbingerexplorer.com

Hat dir das geholfen?

Jede Woche ein neuer Artikel über DACH-Cloud, Data und AI — direkt in dein Postfach. Kein Spam, kein Marketing-Sprech.

Kein Spam. 1-Klick-Abmeldung. Datenschutz bei Loops.so.

Databricks Asset Bundles (DABs): Der komplette Deployment-Guide

Was sind Databricks Asset Bundles?

Installation und Setup

Projekt-Initialisierung

Bundle-Konfiguration im Detail

`databricks.yml` — die Root-Config

Jobs definieren

DLT-Pipelines definieren

Python-Source-Struktur

Bundle-kompatible Notebooks schreiben

CLI-Workflow

CI/CD mit GitHub Actions

Variablen-Management und Secrets

Test-Strategien für Bundles

Unit-Tests (kein Cluster nötig)

Integration-Tests (Cluster nötig)

Deployte Bundles monitoren

Häufige Pitfalls

FAQ

Was ist der Unterschied zwischen DABs und Terraform für Databricks?

Sollte ich `mode: development` oder `mode: production` nutzen?

Können DABs MLflow-Modelle deployen?

Wie versionierst du Wheels in einem Bundle?

Funktionieren DABs mit Unity Catalog?

Fazit

Weitere Artikel aus Data Engineering

Unity Catalog Data Governance: Security, Lineage und Audit

Spark SQL vs Pandas: Wann nutzt du welches Tool?

Data Testing Frameworks: dbt, Great Expectations, Soda, pytest

Databricks Asset Bundles (DABs): Der komplette Deployment-Guide

databricks.yml — die Root-Config

Jobs definieren

DLT-Pipelines definieren

Bundle-kompatible Notebooks schreiben

Unit-Tests (kein Cluster nötig)

Integration-Tests (Cluster nötig)

Was ist der Unterschied zwischen DABs und Terraform für Databricks?

Sollte ich mode: development oder mode: production nutzen?

Können DABs MLflow-Modelle deployen?

Wie versionierst du Wheels in einem Bundle?

Funktionieren DABs mit Unity Catalog?

Weitere Artikel aus Data Engineering

Unity Catalog Data Governance: Security, Lineage und Audit

Spark SQL vs Pandas: Wann nutzt du welches Tool?

Data Testing Frameworks: dbt, Great Expectations, Soda, pytest

`databricks.yml` — die Root-Config

Sollte ich `mode: development` oder `mode: production` nutzen?