CI/CD Pipelines für Databricks: Produktionsreifer Guide

Inhaltsverzeichnis20 Abschnitte

Das Problem mit Notebook-First-Entwicklung
Projektstruktur
Databricks Asset Bundles (DABs)
Root databricks.yml
Job-Definition (resources/jobs.yml)
GitHub Actions aufsetzen
CI-Workflow (.github/workflows/ci.yml)
Deploy-Workflow (.github/workflows/deploy.yml)
Testbaren Code schreiben
Branching-Strategie
Secrets-Management in CI/CD
Deployments monitoren
Häufige Probleme und Fixes
FAQ
Sollte ich Notebooks oder Python-Wheels deployen?
Wie behandle ich Datenbank-Migrationen in CI/CD?
Was, wenn Staging-Tests nicht-deterministisch sind?
Wie hosten DACH-Teams Databricks DSGVO-konform?
Brauche ich databricks bundle deploy oder Terraform?
Wrap-up

TL;DR: Notebook-first ist keine Engineering-Praxis. CI/CD für Databricks geht über Databricks Asset Bundles + GitHub Actions: PRs validieren Bundles und laufen Unit-Tests, Merge auf main deployt nach Staging mit Integrations-Tests, dann auf Prod. Trunk-based mit kurzen Feature-Branches.

Einer der häufigsten Schmerzpunkte von Data-Engineering-Teams, die Databricks adoptieren, ist fehlendes CI/CD. Notebooks werden direkt im UI editiert, "Deployment" heißt Files zwischen Ordnern kopieren, und Produktions-Incidents lassen sich auf jemanden zurückverfolgen, der Ad-hoc-Code im falschen Workspace ausgeführt hat.

Dieser Guide geht durch ein produktionsreifes CI/CD-Setup für Databricks mit Databricks Asset Bundles (DABs) und GitHub Actions. Am Ende hast du automatisiertes Testing, Environment-Promotion und Deployment, das deine Software-Engineering-Kollegen wirklich respektieren werden.

Das Problem mit Notebook-First-Entwicklung

Die meisten Databricks-Teams starten mit Notebooks. Sie sind schnell zum Iterieren, leicht zu teilen und brauchen kein Setup. Aber sie skalieren schlecht:

Keine Diff-Sichtbarkeit — Git zeigt base64-encoded JSON, nicht lesbares Python
Manuelle Promotion — "Deploy nach Prod" heißt Notebook zwischen Workspace-Ordnern ziehen
Kein automatisiertes Testing — wie testest du ein Notebook, bevor es auf Produktionsdaten läuft?
Environment-Drift — Prod- und Dev-Notebooks divergieren still

Die Lösung: behandle dein Databricks-Projekt wie ein richtiges Softwareprojekt — Code in .py-Files, Infrastructure as Code, automatisierte Tests, Deployment-Pipelines.

Projektstruktur

harbinger-pipelines/
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
├── databricks.yml
├── resources/
│   ├── jobs.yml
│   └── pipelines.yml
├── src/
│   ├── pipelines/
│   │   ├── bronze/
│   │   │   └── ingest_events.py
│   │   ├── silver/
│   │   │   └── clean_events.py
│   │   └── gold/
│   │       └── aggregate_signals.py
│   └── utils/
│       ├── schema_utils.py
│       └── quality_checks.py
├── tests/
│   ├── unit/
│   │   └── test_clean_events.py
│   └── integration/
│       └── test_pipeline_e2e.py
├── notebooks/
├── pyproject.toml
└── requirements.txt

Databricks Asset Bundles (DABs)

Asset Bundles sind Databricks' natives IaC-Framework. Sie ersetzen das ältere dbx-Tool und sind der aktuell empfohlene Ansatz.

Root `databricks.yml`

bundle:
  name: harbinger-pipelines

variables:
  environment:
    description: Deployment environment
    default: dev
  catalog:
    description: Unity Catalog name
    default: harbinger_dev

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: dev
      catalog: harbinger_dev

  staging:
    mode: development
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: staging
      catalog: harbinger_staging

  prod:
    mode: production
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: prod
      catalog: harbinger_prod
    run_as:
      service_principal_name: harbinger-prod-sp

Job-Definition (`resources/jobs.yml`)

resources:
  jobs:
    events_ingestion_job:
      name: "harbinger-events-ingestion-${var.environment}"
      schedule:
        quartz_cron_expression: "0 0 * * * ?"
        timezone_id: "UTC"
      tasks:
        - task_key: ingest_bronze
          python_wheel_task:
            package_name: harbinger_pipelines
            entry_point: ingest_events
          job_cluster_key: default

        - task_key: clean_silver
          depends_on:
            - task_key: ingest_bronze
          python_wheel_task:
            package_name: harbinger_pipelines
            entry_point: clean_events
          job_cluster_key: default

      job_clusters:
        - job_cluster_key: default
          new_cluster:
            spark_version: "15.4.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            autoscale:
              min_workers: 2
              max_workers: 8

GitHub Actions aufsetzen

CI-Workflow (`.github/workflows/ci.yml`)

Läuft bei jedem Pull-Request. Validiert Bundle-Konfiguration, führt Unit-Tests aus und macht einen Dry-Deploy.

name: CI

on:
  pull_request:
    branches: [main, staging]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt
          pip install -e .

      - name: Run linting
        run: |
          ruff check src/ tests/
          mypy src/

      - name: Run unit tests
        run: pytest tests/unit/ -v --tb=short

      - name: Install Databricks CLI
        run: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Validate bundle
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
        run: databricks bundle validate --target dev

      - name: Deploy to dev
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
        run: databricks bundle deploy --target dev

Deploy-Workflow (`.github/workflows/deploy.yml`)

Läuft beim Merge auf main. Deployt nach Staging, läuft Integrations-Tests, promotet dann nach Prod.

name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Deploy to staging
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
        run: databricks bundle deploy --target staging

      - name: Run integration tests
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
        run: pytest tests/integration/ -v --timeout=300

  deploy-prod:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Deploy to production
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
        run: databricks bundle deploy --target prod

Testbaren Code schreiben

Der Schlüssel zu testbarem Databricks-Code: Business-Logik von Spark-I/O trennen.

# src/pipelines/silver/clean_events.py

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, trim, upper, when

def clean_event_type(df: DataFrame) -> DataFrame:
    return df.withColumn(
        "event_type",
        when(upper(trim(col("event_type"))).isin(["CONFLICT", "WAR", "BATTLE"]), "CONFLICT")
        .when(upper(trim(col("event_type"))).isin(["NATURAL_DISASTER", "EARTHQUAKE", "FLOOD"]), "DISASTER")
        .otherwise("UNKNOWN")
    )

def filter_low_quality_events(df: DataFrame, min_severity: float = 0.5) -> DataFrame:
    return df.filter(col("severity") >= min_severity)

# tests/unit/test_clean_events.py

import pytest
from pyspark.sql import SparkSession
from src.pipelines.silver.clean_events import clean_event_type, filter_low_quality_events

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[1]").appName("test").getOrCreate()

def test_clean_event_type_normalizes_aliases(spark):
    data = [("war", 5.0), ("FLOOD", 7.0), ("political", 3.0)]
    df = spark.createDataFrame(data, ["event_type", "severity"])
    result = clean_event_type(df)
    types = {row.event_type for row in result.collect()}
    assert "CONFLICT" in types
    assert "DISASTER" in types
    assert "UNKNOWN" in types

def test_filter_removes_low_severity(spark):
    data = [("CONFLICT", 0.3), ("DISASTER", 0.8), ("CONFLICT", 0.5)]
    df = spark.createDataFrame(data, ["event_type", "severity"])
    result = filter_low_quality_events(df, min_severity=0.5)
    assert result.count() == 2

Branching-Strategie

Wir empfehlen ein trunk-based-Development-Modell mit kurzen Feature-Branches:

main  (production)
  |
  +-- feature/add-gdelt-source     PR -> main
  +-- feature/optimize-silver      PR -> main
  +-- hotfix/fix-null-handling     PR -> main (fast-track)

main spiegelt immer den Produktions-Stand
Feature-Branches sind kurzlebig (1–3 Tage max.)
Keine langlebigen Dev/Staging-Branches — stattdessen Environment-Targets in DABs nutzen

Secrets-Management in CI/CD

Databricks-Token als GitHub-Actions-Secrets speichern (nie im Code):

gh secret set DATABRICKS_TOKEN_PROD --body "dapi..."
gh secret set DATABRICKS_TOKEN_STAGING --body "dapi..."
gh secret set DATABRICKS_HOST --body "https://adb-xxx.azuredatabricks.net"

In Databricks: Secret-Scopes nutzen statt Credentials in Job-Configs hardcoden.

Deployments monitoren

Nach dem Deploy den Bundle-State verifizieren:

# Check what is deployed in each target
databricks bundle summary --target prod

# Run a job manually to verify
databricks bundle run events_ingestion_job --target prod

Häufige Probleme und Fixes

Problem	Ursache	Fix
`bundle validate` failed in CI	Fehlende `DATABRICKS_HOST`-Env-Var	Secrets in GitHub-Repo-Settings setzen
Unit-Tests failen mit Spark-Errors	JVM nicht im CI-Runner installiert	`java-version: '11'` in Setup-Action ergänzen
Deploy klappt, Job failed	Wheel nicht vor Job-Run hochgeladen	`bundle deploy` lädt Artefakte hoch
Integrations-Tests laufen in Timeout	Dev-Cluster Cold-Start	`keep_alive: true` für Test-Cluster nutzen
Prod-Deploy übersprungen	Staging-Tests failed	Staging-Tests fixen; das Gate nie überspringen

FAQ

Sollte ich Notebooks oder Python-Wheels deployen?

Python-Wheels für produktiven Code (testbar, versionierbar, mit Pinned Dependencies). Notebooks fürs Exploration- und Reporting-Layer, wo interaktive Iteration zählt.

Wie behandle ich Datenbank-Migrationen in CI/CD?

Schema-Änderungen via versionierte SQL-Migrationen (z. B. Alembic, Liquibase) oder Unity-Catalog-Grants als IaC. Niemals Schema-Änderungen direkt in Produktions-Pipelines.

Was, wenn Staging-Tests nicht-deterministisch sind?

Erst fixen, nicht überspringen. Flaky Integrations-Tests sind das größte Risiko in CI/CD — sie trainieren das Team, Failures zu ignorieren, und das eine Mal, wenn der Test echt failed, wird er weggeklickt.

Wie hosten DACH-Teams Databricks DSGVO-konform?

Azure Databricks in West Europe (Niederlande/Irland) oder Frankfurt für EU-Daten-Residency. Unity Catalog für RBAC, Service-Principals für Service-Account-Logins, Audit-Logs ins eigene Log-Analytics-Workspace.

Brauche ich `databricks bundle deploy` oder Terraform?

DABs sind enger an Databricks gebaut und idiomatischer für reine Databricks-Stacks. Terraform für Cross-Cloud-Networking, Identity-Provider, multi-region Setup.

Wrap-up

Ordentliches CI/CD für Databricks ist nicht optional — es ist das, was Teams unterscheidet, die confident shippen, von Teams, die Freitag-Deploys fürchten. Databricks Asset Bundles, kombiniert mit GitHub Actions, geben dir eine saubere, versionierte, environment-aware Deployment-Pipeline.

Klein anfangen: schon nur bundle validate zu deinen PR-Checks zu ergänzen, fängt Konfigurationsfehler ab, bevor sie Produktion erreichen.

Stand: 14. Mai 2026.

Geschrieben von

Harbinger Team

Cloud-, Data- und AI-Engineer in DACH. Schreibt seit 2018 über infrastrukturkritische Tech-Entscheidungen — keine Marketing- Folien, sondern echte Trade-offs aus Production-Workloads.

Mehr über Marc hello@harbingerexplorer.com

Hat dir das geholfen?

Jede Woche ein neuer Artikel über DACH-Cloud, Data und AI — direkt in dein Postfach. Kein Spam, kein Marketing-Sprech.

Kein Spam. 1-Klick-Abmeldung. Datenschutz bei Loops.so.

CI/CD Pipelines für Databricks: Produktionsreifer Guide

Das Problem mit Notebook-First-Entwicklung

Projektstruktur

Databricks Asset Bundles (DABs)

Root `databricks.yml`

Job-Definition (`resources/jobs.yml`)

GitHub Actions aufsetzen

CI-Workflow (`.github/workflows/ci.yml`)

Deploy-Workflow (`.github/workflows/deploy.yml`)

Testbaren Code schreiben

Branching-Strategie

Secrets-Management in CI/CD

Deployments monitoren

Häufige Probleme und Fixes

FAQ

Sollte ich Notebooks oder Python-Wheels deployen?

Wie behandle ich Datenbank-Migrationen in CI/CD?

Was, wenn Staging-Tests nicht-deterministisch sind?

Wie hosten DACH-Teams Databricks DSGVO-konform?

Brauche ich `databricks bundle deploy` oder Terraform?

Wrap-up

Weitere Artikel aus Data Engineering

Unity Catalog Data Governance: Security, Lineage und Audit

Spark SQL vs Pandas: Wann nutzt du welches Tool?

Data Testing Frameworks: dbt, Great Expectations, Soda, pytest

CI/CD Pipelines für Databricks: Produktionsreifer Guide

Root databricks.yml

Job-Definition (resources/jobs.yml)

CI-Workflow (.github/workflows/ci.yml)

Deploy-Workflow (.github/workflows/deploy.yml)

Sollte ich Notebooks oder Python-Wheels deployen?

Wie behandle ich Datenbank-Migrationen in CI/CD?

Was, wenn Staging-Tests nicht-deterministisch sind?

Wie hosten DACH-Teams Databricks DSGVO-konform?

Brauche ich databricks bundle deploy oder Terraform?

Weitere Artikel aus Data Engineering

Unity Catalog Data Governance: Security, Lineage und Audit

Spark SQL vs Pandas: Wann nutzt du welches Tool?

Data Testing Frameworks: dbt, Great Expectations, Soda, pytest

Root `databricks.yml`

Job-Definition (`resources/jobs.yml`)

CI-Workflow (`.github/workflows/ci.yml`)

Deploy-Workflow (`.github/workflows/deploy.yml`)

Brauche ich `databricks bundle deploy` oder Terraform?