Serverless Data Processing: When It Works and When It Doesn't
Serverless Data Processing: When It Works and When It Doesn't
Serverless is one of the most over-applied concepts in data engineering. The promise — infinite scale, zero ops, pay-per-invocation — attracts teams to use it for workloads it was never designed for. The result is systems that are expensive, hard to debug, and slower than what they replaced.
At the same time, serverless genuinely excels for specific data processing patterns. The problem is most teams don't have a clear decision framework for when to use it.
This guide gives you that framework, with honest benchmarks.
What "Serverless" Actually Means for Data Processing
The term covers several distinct execution models, which have different tradeoffs:
| Model | Examples | Unit of billing | Cold start |
|---|---|---|---|
| Function-as-a-Service (FaaS) | AWS Lambda, GCP Cloud Functions, Azure Functions | Per invocation + duration | 100ms–3s |
| Container-on-demand | Cloud Run, Lambda Container, Azure Container Apps | Per request + CPU-seconds | 1–10s |
| Serverless SQL | Athena, BigQuery, Synapse Serverless | Per TB scanned | N/A (query) |
| Serverless Spark | Databricks Serverless, EMR Serverless, Dataproc Serverless | DBU or vCPU-hours | 30s–2min |
| Serverless streaming | Kinesis, Pub/Sub, EventBridge Pipes | Per message/unit | N/A |
These are fundamentally different products. "Should I use serverless?" is the wrong question — "which serverless model fits this workload?" is the right one.
Where Serverless Works Well
1. Event-Driven Micro-Ingestion
Small, frequent, unpredictable events are the canonical serverless use case. An IoT sensor sends readings when it has something to report — not on a schedule. A webhook fires when a payment completes.
Loading diagram...
This works because:
- Events are small (< 1 MB each)
- Processing is stateless (each event is independent)
- Volume is unpredictable (Lambda handles 0→10,000 events/min without pre-provisioning)
- Cold starts are acceptable (background processing, not user-facing)
Lambda for webhook ingestion:
# lambda_function.py
import json
import boto3
import os
from datetime import datetime
s3 = boto3.client('s3')
BUCKET = os.environ['BRONZE_BUCKET']
def lambda_handler(event, context):
# Parse webhook payload
payload = json.loads(event['body'])
# Enrich with metadata
record = {
**payload,
"_ingested_at": datetime.utcnow().isoformat(),
"_source": event['headers'].get('X-Webhook-Source', 'unknown'),
"_partition_date": datetime.utcnow().strftime('%Y/%m/%d')
}
# Write to S3 with partitioning
key = f"webhooks/{record['_partition_date']}/{context.aws_request_id}.json"
s3.put_object(
Bucket=BUCKET,
Key=key,
Body=json.dumps(record),
ContentType='application/json'
)
return {'statusCode': 200, 'body': 'OK'}
# Terraform — Lambda with SQS trigger and DLQ
resource "aws_lambda_function" "webhook_ingestion" {
filename = "webhook_ingestion.zip"
function_name = "webhook-ingestion"
role = aws_iam_role.lambda_ingestion.arn
handler = "lambda_function.lambda_handler"
runtime = "python3.12"
timeout = 30
memory_size = 256
environment {
variables = {
BRONZE_BUCKET = aws_s3_bucket.bronze.id
}
}
dead_letter_config {
target_arn = aws_sqs_queue.ingestion_dlq.arn
}
}
resource "aws_lambda_event_source_mapping" "sqs_trigger" {
event_source_arn = aws_sqs_queue.webhook_queue.arn
function_name = aws_lambda_function.webhook_ingestion.arn
batch_size = 100
filter_criteria {
filter {
pattern = jsonencode({
body = {
event_type = ["payment.completed", "payment.failed"]
}
})
}
}
}
2. Serverless SQL for Ad-Hoc Analytics
Athena and BigQuery are the clearest serverless wins in the data space. Zero infrastructure, SQL interface, pay per TB scanned.
When it's the right call:
- Queries run 0-20× per day (on-demand is cheaper than reserved capacity)
- Data is already in S3/GCS (no movement cost)
- Queries are exploratory, not production SLA-bound
-- Athena query with partition pruning (fast + cheap)
SELECT
event_type,
COUNT(*) as event_count,
SUM(revenue_usd) as total_revenue
FROM events
WHERE
year = '2024'
AND month = '03'
AND day BETWEEN '01' AND '31'
AND event_type IN ('purchase', 'subscription')
GROUP BY 1
ORDER BY 3 DESC;
-- Scans ~2 GB (partitioned) vs 800 GB (unpartitioned) — 400x cost difference
3. Orchestration Glue and Data Quality Checks
Lightweight, infrequent jobs that check data quality, trigger downstream pipelines, or fan out work are ideal for serverless.
# AWS Step Functions — serverless orchestration
StateMachine:
Type: AWS::StepFunctions::StateMachine
Properties:
Definition:
StartAt: ValidateSchema
States:
ValidateSchema:
Type: Task
Resource: !GetAtt SchemaValidationLambda.Arn
Next: BranchByResult
BranchByResult:
Type: Choice
Choices:
- Variable: $.validation_passed
BooleanEquals: true
Next: TriggerTransformation
Default: AlertAndFail
TriggerTransformation:
Type: Task
Resource: arn:aws:states:::glue:startJobRun.sync
Parameters:
JobName: silver-transformation
End: true
AlertAndFail:
Type: Task
Resource: !GetAtt AlertLambda.Arn
Next: Fail
Fail:
Type: Fail
Where Serverless Fails
1. Long-Running, Memory-Intensive Batch Jobs
Lambda has a 15-minute timeout and 10 GB memory limit. Cloud Run has 60-minute timeout and 32 GB. Neither is appropriate for a 4-hour Spark job processing 10 TB of data.
The failure pattern:
Team tries to replace Spark cluster with Lambda for nightly ETL.
- Job runs 2h → Lambda times out at 15min
- Team splits job into 1000 smaller Lambdas
- Cold starts add 30 min overhead
- Coordination logic becomes more complex than the original job
- Cost: $340/night vs $12/night with Spot EMR
The irony: the operational simplicity of serverless disappears when you're orchestrating thousands of functions to simulate what Spark does natively.
2. High-Throughput Streaming with Stateful Processing
Lambda + Kinesis can handle ~1 MB/s per shard. For a 10 MB/s stream with stateful windowing (session analysis, fraud detection), you hit limits fast.
Benchmarks — 100 events/sec sustained for 8h:
| Approach | Cost | Latency P99 | Max throughput |
|---|---|---|---|
| Lambda (Kinesis trigger) | $18/day | 800ms | ~5k events/s |
| Flink on EKS | $22/day | 45ms | 500k+ events/s |
| Flink on EMR Serverless | $28/day | 55ms | 200k events/s |
Lambda loses on latency and throughput ceiling. Flink wins on both, and the cost delta is small at scale.
3. ML Inference at High Volume
A model inference Lambda handling 1,000 requests/second with a 100ms p50 latency looks cheap. Until you calculate:
1,000 req/s × 100ms × 1 GB memory = 100 GB-seconds/s
100 GB-seconds/s × 86,400 s/day = 8,640,000 GB-seconds/day
Cost: 8,640,000 × $0.0000166667 = $144/day = $4,320/month
Same workload on 3× ml.c5.2xlarge (8 vCPU, 16 GB):
$0.464/hr × 3 × 720hr = $1,002/month
Serverless is 4× more expensive for this workload, and you get worse tail latency due to cold starts.
4. The Cold Start Tax for User-Facing APIs
If your data API needs < 200ms p99 latency, serverless functions are usually the wrong choice without aggressive provisioned concurrency (which dramatically reduces the cost benefit).
Lambda cold start breakdown (Python 3.12, 512MB):
- Container initialization: 80-200ms
- Runtime initialization: 50-150ms
- Handler initialization (imports, connections): 100-500ms
Total: 230ms - 850ms added to first request
Provisioned concurrency eliminates cold starts but costs $0.0000646/function-second — roughly equivalent to keeping EC2 instances running.
The Decision Framework
Loading diagram...
Serverless Spark: The Middle Ground
AWS EMR Serverless and Databricks Serverless Compute solve the main pain points of traditional serverless for data workloads: no cold start lock-in, no timeout limits, genuine Spark-scale processing.
# EMR Serverless — submit Spark job
aws emr-serverless start-job-run --application-id app-1234567890abcdef --execution-role-arn arn:aws:iam::123456789:role/emr-serverless-execution --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://my-bucket/scripts/transform.py",
"sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=8g"
}
}' --configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://my-bucket/logs/"
}
}
}'
EMR Serverless vs EMR on EC2 (1 TB job, 2× monthly):
| EMR Serverless | EMR on EC2 (Spot) | |
|---|---|---|
| Setup time | 0 min | 8 min |
| Cost per run | ~$4.20 | ~$2.80 |
| Monthly (2 runs) | ~$8.40 | ~$5.60 + cluster fixed cost |
| Idle cost | $0 | $0 (if terminated) |
| Operational effort | Very low | Low-medium |
EMR Serverless wins clearly for infrequent jobs. For daily+ jobs, managed clusters with spot instances win on cost.
Observability for Serverless Data Pipelines
The hardest part of serverless debugging: distributed execution, ephemeral logs, and no SSH.
# Structured logging for Lambda — mandatory for production
import json
import logging
import time
from functools import wraps
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def log_invocation(func):
@wraps(func)
def wrapper(event, context):
start = time.time()
request_id = context.aws_request_id
logger.info(json.dumps({
"event": "lambda_start",
"request_id": request_id,
"function": context.function_name,
"memory_mb": context.memory_limit_in_mb,
"records_count": len(event.get('Records', [event]))
}))
try:
result = func(event, context)
duration_ms = (time.time() - start) * 1000
logger.info(json.dumps({
"event": "lambda_success",
"request_id": request_id,
"duration_ms": round(duration_ms, 2),
"remaining_ms": context.get_remaining_time_in_millis()
}))
return result
except Exception as e:
logger.error(json.dumps({
"event": "lambda_error",
"request_id": request_id,
"error": str(e),
"error_type": type(e).__name__
}))
raise
return wrapper
@log_invocation
def lambda_handler(event, context):
# Your actual logic here
pass
Summary: The Honest Assessment
Serverless data processing is genuinely excellent for event-driven ingestion, ad-hoc SQL analytics, and infrequent batch jobs. It's a poor fit for long-running jobs, high-throughput streaming, ML inference at scale, and latency-sensitive user-facing APIs.
The industry is moving toward serverless Spark (EMR Serverless, Databricks Serverless) as a compelling middle ground — you get managed infrastructure and automatic scaling without the hard limits of FaaS.
Use the decision framework: duration, memory, frequency, and cost comparison. The right answer is workload-specific, not a blanket "serverless is modern, therefore correct."
Try Harbinger Explorer free for 7 days — test your serverless data API endpoints, validate response schemas under load, and identify cold start latency issues before your users do. harbingerexplorer.com
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial