Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Running Data Workloads on Kubernetes: Patterns and Pitfalls

14 min read·Tags: kubernetes, spark, kafka, data-pipelines, platform-engineering, k8s

Running Data Workloads on Kubernetes: Patterns and Pitfalls

Kubernetes was designed for stateless, containerized applications. Data workloads are often stateful, resource-hungry, and sensitive to scheduling jitter. Yet teams are increasingly consolidating data infrastructure onto K8s — driven by the promise of unified scheduling, better resource utilization, and CI/CD-native workflows.

This guide covers what actually works in production: running Spark on Kubernetes, operating Kafka via operators, scheduling data pipelines with Argo Workflows, and avoiding the most common failure patterns.


Why Run Data Workloads on K8s?

The case for K8s-native data infrastructure isn't about hype — it's about operational leverage:

Traditional ApproachK8s-Native Approach
Per-cluster Spark YARNSpark-on-K8s with namespace-level isolation
Kafka VMs with manual broker scalingStrimzi operator with GitOps-managed config
Airflow on dedicated VMsAirflow on K8s Executor or Argo Workflows
Separate infra per teamMulti-tenant namespaces with ResourceQuotas
Manual node provisioningKarpenter / Cluster Autoscaler with node pools

The tradeoff: stateful workloads need persistent storage, controlled eviction, and careful network configuration that stateless apps don't require. Get these right and K8s becomes a genuine force multiplier.


Spark on Kubernetes

Architecture

Loading diagram...

Submitting Jobs

# spark-submit against a K8s cluster
spark-submit   --master k8s://https://k8s-api.internal:6443   --deploy-mode cluster   --name etl-orders-daily   --conf spark.kubernetes.namespace=data-platform   --conf spark.kubernetes.container.image=company-registry/spark:3.5.1-python3.11   --conf spark.kubernetes.serviceAccountName=spark-executor   --conf spark.executor.instances=10   --conf spark.executor.cores=4   --conf spark.executor.memory=8g   --conf spark.driver.memory=4g   --conf spark.kubernetes.driver.request.cores=2   --conf spark.kubernetes.executor.request.cores=3   --conf spark.eventLog.enabled=true   --conf spark.eventLog.dir=s3a://data-platform-logs/spark-events/   --conf spark.kubernetes.executor.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict=false   local:///opt/spark/work-dir/jobs/orders_daily.py

The safe-to-evict=false annotation is critical — without it, Cluster Autoscaler will evict executor pods mid-job when scaling down, causing cascading task failures.

Node Pools for Data Workloads

Don't mix Spark executors with API servers on the same node pool. Memory-intensive Spark jobs cause noisy-neighbor problems that create latency spikes in unrelated services.

# Terraform: dedicated node pool for Spark executors (GKE example)
resource "google_container_node_pool" "spark_executor_pool" {
  name       = "spark-executor-pool"
  cluster    = google_container_cluster.main.name
  location   = var.region

  autoscaling {
    min_node_count  = 0
    max_node_count  = 50
    location_policy = "BALANCED"
  }

  node_config {
    machine_type = "n2-highmem-16"  # 16 vCPU, 128 GB RAM
    disk_size_gb = 200
    disk_type    = "pd-ssd"

    taint {
      key    = "workload"
      value  = "spark-executor"
      effect = "NO_SCHEDULE"
    }

    labels = {
      workload = "spark-executor"
      team     = "data-platform"
    }

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

Match with a toleration in your Spark config:

--conf spark.kubernetes.executor.node.selector.workload=spark-executor
--conf spark.kubernetes.executor.tolerations=[{"key":"workload","operator":"Equal","value":"spark-executor","effect":"NoSchedule"}]

Kafka on Kubernetes with Strimzi

The Strimzi operator is the production-grade way to run Kafka on K8s. It handles broker lifecycle, rolling upgrades, TLS certificate rotation, and Cruise Control integration for partition rebalancing.

Cluster Definition

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: data-platform-kafka
  namespace: kafka
spec:
  kafka:
    version: 3.7.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.7"
      log.retention.hours: 168
      log.segment.bytes: 1073741824
      log.retention.check.interval.ms: 300000
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 500Gi
          class: premium-ssd
          deleteClaim: false
    resources:
      requests:
        memory: 16Gi
        cpu: "4"
      limits:
        memory: 16Gi
        cpu: "8"
    rack:
      topologyKey: topology.kubernetes.io/zone
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - data-platform-kafka-kafka
                topologyKey: kubernetes.io/hostname
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 50Gi
      class: premium-ssd
      deleteClaim: false
  entityOperator:
    topicOperator: {}
    userOperator: {}
  cruiseControl: {}

The rack configuration maps to AZ topology — Strimzi will distribute replicas across zones automatically, which is essential for HA.

Topic Management via GitOps

# KafkaTopic CR — manage via Git, not kafka-topics.sh
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders-events-v2
  namespace: kafka
  labels:
    strimzi.io/cluster: data-platform-kafka
spec:
  partitions: 24
  replicas: 3
  config:
    retention.ms: "604800000"    # 7 days
    cleanup.policy: delete
    compression.type: lz4
    min.insync.replicas: "2"
    message.timestamp.type: LogAppendTime

Pipeline Orchestration with Argo Workflows

For data pipelines that need DAG semantics without a separate Airflow cluster, Argo Workflows is a compelling K8s-native option.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: daily-etl-pipeline
  namespace: data-platform
spec:
  entrypoint: etl-dag
  serviceAccountName: argo-workflow-sa
  parallelism: 5
  
  templates:
    - name: etl-dag
      dag:
        tasks:
          - name: extract-orders
            template: spark-job
            arguments:
              parameters:
                - name: job-class
                  value: "com.company.etl.ExtractOrders"
                - name: date
                  value: "{{workflow.parameters.date}}"

          - name: extract-customers
            template: spark-job
            arguments:
              parameters:
                - name: job-class
                  value: "com.company.etl.ExtractCustomers"
                - name: date
                  value: "{{workflow.parameters.date}}"

          - name: transform-and-load
            dependencies: [extract-orders, extract-customers]
            template: spark-job
            arguments:
              parameters:
                - name: job-class
                  value: "com.company.etl.TransformAndLoad"
                - name: date
                  value: "{{workflow.parameters.date}}"

    - name: spark-job
      inputs:
        parameters:
          - name: job-class
          - name: date
      resource:
        action: create
        successCondition: status.applicationState.state == COMPLETED
        failureCondition: status.applicationState.state == FAILED
        manifest: |
          apiVersion: sparkoperator.k8s.io/v1beta2
          kind: SparkApplication
          metadata:
            generateName: etl-job-
            namespace: data-platform
          spec:
            type: Scala
            mode: cluster
            image: company-registry/spark-etl:latest
            mainClass: "{{inputs.parameters.job-class}}"
            arguments:
              - "--date={{inputs.parameters.date}}"
            driver:
              cores: 2
              memory: "4g"
              serviceAccount: spark-executor
            executor:
              cores: 4
              instances: 8
              memory: "8g"

Common Failure Patterns and Fixes

Failure ModeRoot CauseFix
Executor OOM killsMemory limits too low or unbounded broadcast joinsSet spark.sql.autoBroadcastJoinThreshold=-1, tune executor memory
Driver pod evictedDriver on node with resource pressureAdd PriorityClass: high-priority to driver pods
Shuffle data lostExecutor evicted mid-jobUse remote shuffle service (e.g., Magnet, Uniffle)
Kafka lag accumulatesConsumer pod CPU throttledMove consumers to node pool with no CPU limits
PVC provisioning delayStorageClass binding modeUse WaitForFirstConsumer binding mode
Slow pod startupLarge container imagesOptimize to <2GB; use image pull pre-warming

Observability for K8s Data Workloads

Instrument at three layers:

  1. Infrastructure layer: node CPU/memory/disk saturation, pod scheduling latency
  2. Workload layer: Spark job duration, Kafka consumer lag, executor failure rate
  3. Data layer: table freshness, row count deltas, schema drift

For platform teams managing dozens of data workloads across K8s namespaces, a unified data observability tool like Harbinger Explorer bridges the gap between K8s metrics (which tell you the pod OOM'd) and data metrics (which tell you which tables are stale as a result).


Summary

Running data workloads on Kubernetes is achievable in production — but it requires treating your data pods as first-class citizens in your scheduling, storage, and observability strategy.

Key patterns that work:

  • Dedicated node pools with taints for Spark executors
  • Strimzi operator with GitOps-managed KafkaTopic CRs
  • safe-to-evict=false on all executor pods
  • Argo Workflows for K8s-native DAG orchestration
  • Multi-layer observability: infra + workload + data

Try Harbinger Explorer free for 7 days — unified observability for your K8s data platform, from executor health to table freshness, all in one place.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...