Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Containerized Data Pipelines: Docker and Kubernetes for Platform Engineers

14 min read·Tags: kubernetes, docker, airflow, spark, data-pipelines, devops

Containerized Data Pipelines: Docker and Kubernetes for Platform Engineers

Container-native data pipelines have moved from experimental to standard practice. The combination of Docker for reproducible environments and Kubernetes for elastic scheduling has transformed how platform teams build, deploy, and operate data workloads. This guide goes deep on patterns that actually work in production.


Why Containers for Data Pipelines?

The case for containerisation has never been stronger:

Problem (Pre-Container)Solution (Container-Native)
"Works on my machine" dependency hellImmutable images with pinned dependencies
Shared-cluster resource contentionNamespace isolation with resource quotas
Scaling Spark requires cluster resizeSpark Operator auto-provisions workers per job
Deploying new Python versions breaks everythingEach pipeline carries its own runtime
Audit trail for pipeline environmentsImage digest = reproducible environment
Multi-team monolithic Airflow deploymentIsolated DAG environments per team

The Container-Native Data Platform Stack

Loading diagram...

Dockerfile Best Practices for Data Pipelines

Multi-Stage Build for PySpark

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder

WORKDIR /build

# Install build tools
RUN apt-get update && apt-get install -y --no-install-recommends     gcc     libpq-dev     && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Stage 2: Runtime image
FROM openjdk:17-slim AS runtime

# Install Python
RUN apt-get update && apt-get install -y --no-install-recommends     python3.11     python3-pip     && rm -rf /var/lib/apt/lists/*

# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local

# Install Spark
ENV SPARK_VERSION=3.5.1
ENV HADOOP_VERSION=3
RUN pip install pyspark==${SPARK_VERSION}

# Copy pipeline code
WORKDIR /app
COPY src/ ./src/
COPY entrypoint.sh .

RUN chmod +x entrypoint.sh

# Non-root user for security
RUN useradd -m -u 1000 spark
USER spark

ENTRYPOINT ["./entrypoint.sh"]

Key principles:

  • Multi-stage builds: keep runtime images small
  • Non-root user: reduces attack surface
  • Pin all versions: requirements.txt with exact hashes
  • No secrets in image: use K8s Secrets or external secret stores

Image Tagging Strategy

Never use :latest in production. Use immutable tags:

# Tag with git SHA
IMAGE_TAG=$(git rev-parse --short HEAD)
docker build -t ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG} .
docker push ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG}

# Also tag with semantic version for release images
docker tag ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG}            ecr.amazonaws.com/harbinger/pipeline:v1.4.2

Airflow on Kubernetes: KubernetesExecutor

The KubernetesExecutor spawns a fresh pod for every task. This is the production-grade approach for teams that want:

  • Task isolation: one failing task cannot affect another
  • Resource customisation: CPU/memory per task type
  • Horizontal scalability: no fixed worker pool

Helm Deployment

helm repo add apache-airflow https://airflow.apache.org
helm repo update

helm install airflow apache-airflow/airflow   --namespace airflow   --create-namespace   --set executor=KubernetesExecutor   --set images.airflow.repository=ecr.amazonaws.com/harbinger/airflow   --set images.airflow.tag=2.9.1   --set config.core.dags_folder=/opt/airflow/dags   --values airflow-values.yaml

airflow-values.yaml:

executor: KubernetesExecutor

workers:
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"

scheduler:
  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"

webserver:
  replicas: 2

logs:
  persistence:
    enabled: false  # Use remote logging
  remote:
    remoteLogging: true
    remoteBaseLogFolder: s3://harbinger-airflow-logs/

config:
  kubernetes:
    namespace: airflow
    worker_container_repository: ecr.amazonaws.com/harbinger/airflow
    worker_container_tag: 2.9.1
    delete_worker_pods: true
    delete_worker_pods_on_failure: false  # Keep for debugging

extraEnvFrom:
  - secretRef:
      name: airflow-connections

Task-Level Resource Overrides

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s
from datetime import datetime

with DAG(
    "harbinger_geopolitical_etl",
    start_date=datetime(2024, 1, 1),
    schedule="@hourly",
    catchup=False,
) as dag:

    # Heavy Spark job — request more resources
    spark_transform = KubernetesPodOperator(
        task_id="spark_transform",
        name="spark-transform",
        namespace="airflow",
        image="ecr.amazonaws.com/harbinger/spark-pipeline:v1.4.2",
        image_pull_policy="Always",
        env_vars={
            "INPUT_PATH": "s3://harbinger-raw/events/",
            "OUTPUT_PATH": "s3://harbinger-processed/events/",
        },
        container_resources=k8s.V1ResourceRequirements(
            requests={"memory": "4Gi", "cpu": "2"},
            limits={"memory": "8Gi", "cpu": "4"},
        ),
        node_selector={"node-type": "compute-optimised"},
        tolerations=[
            k8s.V1Toleration(key="dedicated", value="spark", effect="NoSchedule")
        ],
        is_delete_operator_pod=True,
        get_logs=True,
    )

    # Light API call — minimal resources
    api_ingest = KubernetesPodOperator(
        task_id="api_ingest",
        name="api-ingest",
        namespace="airflow",
        image="ecr.amazonaws.com/harbinger/ingest:v1.4.2",
        container_resources=k8s.V1ResourceRequirements(
            requests={"memory": "256Mi", "cpu": "100m"},
            limits={"memory": "512Mi", "cpu": "500m"},
        ),
        is_delete_operator_pod=True,
    )

    api_ingest >> spark_transform

Spark Operator

The Kubernetes Operator for Apache Spark manages the full lifecycle of Spark applications as Kubernetes-native custom resources.

Install via Helm

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm install spark-operator spark-operator/spark-operator   --namespace spark-operator   --create-namespace   --set webhook.enable=true   --set metrics.enable=true   --set metrics.port=10254

SparkApplication CRD

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: harbinger-event-enrichment
  namespace: spark-jobs
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: ecr.amazonaws.com/harbinger/spark-pipeline:v1.4.2
  imagePullPolicy: Always
  mainApplicationFile: local:///app/src/enrich_events.py

  sparkVersion: "3.5.1"

  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 30

  driver:
    cores: 2
    memory: "2g"
    serviceAccount: spark
    labels:
      app: harbinger-spark
    annotations:
      iam.amazonaws.com/role: arn:aws:iam::123456789:role/SparkPipelineRole

  executor:
    cores: 4
    instances: 10
    memory: "8g"
    memoryOverhead: "2g"
    labels:
      app: harbinger-spark
    tolerations:
      - key: dedicated
        value: spark
        effect: NoSchedule

  sparkConf:
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    "spark.dynamicAllocation.enabled": "true"
    "spark.dynamicAllocation.minExecutors": "2"
    "spark.dynamicAllocation.maxExecutors": "50"
    "spark.dynamicAllocation.shuffleTracking.enabled": "true"

  volumes:
    - name: spark-local-dir
      emptyDir: {}

  driver:
    volumeMounts:
      - name: spark-local-dir
        mountPath: /tmp/spark-local

Resource Management and Autoscaling

Namespace Resource Quotas

Prevent any single team from monopolising cluster resources:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: spark-jobs-quota
  namespace: spark-jobs
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 400Gi
    limits.cpu: "200"
    limits.memory: 800Gi
    count/pods: "500"
    count/sparkoperator.k8s.io/v1beta2/sparkapplications: "20"

Cluster Autoscaler

For cloud-managed clusters (EKS, GKE, AKS), configure node auto-provisioning:

# EKS managed node group for Spark workers
resource "aws_eks_node_group" "spark" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "spark-workers"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  scaling_config {
    desired_size = 2
    min_size     = 0
    max_size     = 50
  }

  instance_types = ["r5.4xlarge"]  # Memory-optimised for Spark

  taint {
    key    = "dedicated"
    value  = "spark"
    effect = "NO_SCHEDULE"
  }

  labels = {
    "node-type" = "compute-optimised"
  }
}

CI/CD Pipeline for Data Pipelines

# .github/workflows/pipeline-deploy.yml
name: Build and Deploy Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'pipelines/**'
      - 'Dockerfile'

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole
          aws-region: eu-west-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: |
            ${{ env.ECR_REGISTRY }}/harbinger/pipeline:${{ github.sha }}
            ${{ env.ECR_REGISTRY }}/harbinger/pipeline:latest

      - name: Run security scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.ECR_REGISTRY }}/harbinger/pipeline:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Update Spark application
        run: |
          kubectl set image sparkapplication/harbinger-event-enrichment             spark-kubernetes-driver=${{ needs.build.outputs.image-tag }}             -n spark-jobs

Observability

Key Metrics for Data Pipeline Pods

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: spark-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: harbinger-spark
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - spark-jobs

Grafana dashboard panels to build:

  • Pipeline throughput: records processed per minute
  • Pod restart count: detect crash loops
  • CPU throttling ratio: signals under-provisioned limits
  • Pending pod duration: K8s scheduling latency
  • Spark executor utilisation: detect over/under-scaling

Conclusion

Container-native data pipelines with Kubernetes eliminate entire classes of operational problems — dependency conflicts, unpredictable resource usage, and environment drift. The initial investment in Dockerfiles, Helm charts, and Spark Operator configuration pays back quickly through faster deployments, better resource utilisation, and dramatically simpler debugging.

Platforms like Harbinger Explorer run their entire data ingestion and enrichment pipeline on Kubernetes, enabling them to scale from zero to hundreds of parallel Spark executors in response to geopolitical event surges — and scale back to near zero overnight.


Try Harbinger Explorer free for 7 days — powered by a Kubernetes-native data platform that scales with the world's events.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...