API Gateway Architecture Patterns for Data Platforms
API Gateway Architecture Patterns for Data Platforms
Data platforms have traditionally served data through direct warehouse connections, JDBC endpoints, and blob storage presigned URLs. As organizations mature, these ad-hoc access patterns create governance nightmares: who has access to what, how much are they querying, and is the API contract stable enough for downstream teams to depend on?
API gateways solve this by inserting a managed control plane between data consumers and data systems. This guide covers the patterns that work in production for data platform teams.
Why Data Platforms Need API Gateways
The case for gateways isn't just about security theatre. It's about operational capability:
| Problem Without Gateway | Gateway Solution |
|---|---|
| Direct warehouse connections hit quota limits | Rate limiting per consumer |
| No visibility into who's querying what | Centralized access logging |
| Consumers tightly coupled to warehouse internals | Schema abstraction layer |
| Auth handled ad-hoc per service | Centralized OAuth/API key auth |
| No way to deprecate old schemas safely | API versioning + sunset headers |
| Cross-team data access negotiated manually | Self-service data product APIs |
Gateway Architecture Patterns
Pattern 1: Passthrough Gateway (Simple)
The simplest pattern: gateway handles auth and rate limiting, passes requests directly to the warehouse or data service.
Loading diagram...
Good for: small teams, internal APIs, early-stage data products.
Limitation: tightly couples API schema to warehouse schema — any warehouse refactor breaks consumers.
Pattern 2: Transformation Gateway (Recommended)
The gateway applies a transformation layer: incoming REST requests are translated into warehouse queries, responses are shaped before returning.
Loading diagram...
This pattern enables stable API contracts that survive warehouse refactors. The transformation layer owns the mapping between API schema (what consumers see) and warehouse schema (what actually stores the data).
Pattern 3: Data Mesh Gateway (Advanced)
In a data mesh architecture, the gateway is the entry point to domain-owned data products. Each domain exposes its data through a standardized API contract; the gateway provides discovery, federation, and cross-domain lineage.
Loading diagram...
The central gateway in a data mesh context handles:
- API discovery (which data products exist and what they expose)
- Cross-domain auth (consumers auth once, gateway negotiates domain permissions)
- Lineage tracking (which consumers depend on which data products)
Harbinger Explorer is well-suited to the discovery and lineage layer in this pattern — it maintains the cross-domain dependency graph that makes data mesh governance tractable.
Implementation: AWS API Gateway + Lambda Authorizer
Terraform Configuration
# API Gateway for Data Platform
resource "aws_api_gateway_rest_api" "data_platform" {
name = "data-platform-api"
description = "Central API gateway for data platform products"
endpoint_configuration {
types = ["REGIONAL"]
}
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = "*"
Action = "execute-api:Invoke"
Resource = "arn:aws:execute-api:*:*:*"
Condition = {
IpAddress = {
"aws:SourceIp" = var.allowed_cidr_ranges
}
}
}]
})
}
resource "aws_api_gateway_deployment" "data_platform" {
rest_api_id = aws_api_gateway_rest_api.data_platform.id
triggers = {
redeployment = sha1(jsonencode(aws_api_gateway_rest_api.data_platform.body))
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_api_gateway_stage" "production" {
deployment_id = aws_api_gateway_deployment.data_platform.id
rest_api_id = aws_api_gateway_rest_api.data_platform.id
stage_name = "v1"
access_log_settings {
destination_arn = aws_cloudwatch_log_group.api_access_log.arn
format = jsonencode({
requestId = "$context.requestId"
sourceIp = "$context.identity.sourceIp"
requestTime = "$context.requestTime"
protocol = "$context.protocol"
httpMethod = "$context.httpMethod"
resourcePath = "$context.resourcePath"
routeKey = "$context.routeKey"
status = "$context.status"
responseLength = "$context.responseLength"
integrationLatency = "$context.integrationLatency"
userAgent = "$context.identity.userAgent"
# Custom: data platform tracking
consumerId = "$context.authorizer.consumerId"
dataProduct = "$context.authorizer.dataProduct"
})
}
default_route_settings {
throttling_burst_limit = 100
throttling_rate_limit = 50
}
}
# Usage plans for rate limiting per consumer tier
resource "aws_api_gateway_usage_plan" "free_tier" {
name = "data-platform-free"
api_stages {
api_id = aws_api_gateway_rest_api.data_platform.id
stage = aws_api_gateway_stage.production.stage_name
}
throttle_settings {
burst_limit = 10
rate_limit = 5
}
quota_settings {
limit = 10000
period = "MONTH"
}
}
resource "aws_api_gateway_usage_plan" "professional" {
name = "data-platform-professional"
api_stages {
api_id = aws_api_gateway_rest_api.data_platform.id
stage = aws_api_gateway_stage.production.stage_name
}
throttle_settings {
burst_limit = 500
rate_limit = 100
}
quota_settings {
limit = 1000000
period = "MONTH"
}
}
Lambda Authorizer for JWT Validation
# lambda_authorizer.py
import json
import os
import re
from typing import Optional
import boto3
import jwt
from jwt import PyJWKClient
JWKS_URI = os.environ["JWKS_URI"] # e.g. https://auth.company.com/.well-known/jwks.json
AUDIENCE = os.environ["TOKEN_AUDIENCE"]
jwks_client = PyJWKClient(JWKS_URI, cache_keys=True)
def handler(event: dict, context) -> dict:
'''Lambda authorizer: validate JWT and return IAM policy.'''
token = extract_token(event)
if not token:
raise Exception("Unauthorized")
try:
signing_key = jwks_client.get_signing_key_from_jwt(token)
payload = jwt.decode(
token,
signing_key.key,
algorithms=["RS256"],
audience=AUDIENCE
)
except jwt.ExpiredSignatureError:
raise Exception("Unauthorized: token expired")
except jwt.InvalidTokenError as e:
raise Exception(f"Unauthorized: {e}")
consumer_id = payload.get("sub")
scopes = payload.get("scope", "").split()
# Map scopes to API Gateway resource permissions
policy = build_policy(consumer_id, scopes, event["methodArn"])
policy["context"] = {
"consumerId": consumer_id,
"scopes": " ".join(scopes),
"dataProduct": extract_data_product(event["methodArn"])
}
return policy
def extract_token(event: dict) -> Optional[str]:
auth_header = event.get("authorizationToken", "")
if auth_header.startswith("Bearer "):
return auth_header[7:]
return event.get("queryStringParameters", {}).get("token")
def build_policy(principal: str, scopes: list, method_arn: str) -> dict:
# Parse ARN to determine which resources to allow
arn_parts = method_arn.split(":")
region = arn_parts[3]
account = arn_parts[4]
api_id = arn_parts[5].split("/")[0]
stage = arn_parts[5].split("/")[1]
allowed_resources = []
# Map scopes to allowed resource paths
scope_resource_map = {
"data:orders:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/orders*",
"data:customers:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/customers*",
"data:inventory:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/inventory*",
"data:admin": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/*/*",
}
for scope in scopes:
if scope in scope_resource_map:
allowed_resources.append(scope_resource_map[scope])
return {
"principalId": principal,
"policyDocument": {
"Version": "2012-10-17",
"Statement": [{
"Action": "execute-api:Invoke",
"Effect": "Allow" if allowed_resources else "Deny",
"Resource": allowed_resources or [method_arn]
}]
}
}
def extract_data_product(method_arn: str) -> str:
# Extract data product name from ARN path
parts = method_arn.split("/")
if len(parts) >= 4:
return parts[3] # e.g. "orders" from /v1/orders
return "unknown"
API Versioning Strategy
Data API versioning requires more careful thought than typical REST APIs because downstream consumers often run batch jobs that can't be updated instantaneously.
URL-Based Versioning (Recommended for Data APIs)
/v1/orders → stable, supported
/v2/orders → new schema, active development
/v1/orders [Sunset: 2025-09-01] → deprecated, add Sunset header
Always include Sunset and Deprecation headers for deprecated versions:
# FastAPI: data product endpoint with versioning headers
from fastapi import FastAPI, Response
from datetime import datetime
app = FastAPI()
@app.get("/v1/orders")
async def get_orders_v1(response: Response):
# V1 is deprecated — add sunset headers
response.headers["Deprecation"] = "true"
response.headers["Sunset"] = "Sat, 01 Sep 2025 00:00:00 GMT"
response.headers["Link"] = '</v2/orders>; rel="successor-version"'
# Return V1 schema (legacy format)
return {"orders": [], "total": 0, "page": 1}
@app.get("/v2/orders")
async def get_orders_v2(
response: Response,
date_from: str = None,
date_to: str = None,
status: str = None,
limit: int = 100,
cursor: str = None
):
# V2: cursor-based pagination, ISO dates, richer filtering
response.headers["X-Data-Version"] = "2.0"
return {
"data": [],
"pagination": {
"cursor": None,
"has_more": False,
"limit": limit
},
"meta": {
"generated_at": datetime.utcnow().isoformat(),
"data_freshness_seconds": 30
}
}
Rate Limiting Patterns
Kong Gateway Configuration
For teams using Kong as their gateway layer:
# Kong declarative config (deck format)
services:
- name: orders-data-product
url: http://orders-service.data-platform.svc.cluster.local:8080
routes:
- name: orders-api
paths:
- /v1/orders
- /v2/orders
methods:
- GET
plugins:
- name: rate-limiting
config:
minute: 60
hour: 1000
policy: redis
redis_host: redis.infra.svc.cluster.local
redis_port: 6379
redis_database: 1
limit_by: consumer
- name: jwt
config:
secret_is_base64: false
claims_to_verify:
- exp
- nbf
- name: request-transformer
config:
add:
headers:
- "X-Consumer-ID:$(consumer.id)"
- "X-Data-Platform-Gateway:true"
- name: response-transformer
config:
add:
headers:
- "X-Rate-Limit-Info:see X-RateLimit-* headers"
- name: http-log
config:
http_endpoint: http://audit-log.data-platform.svc.cluster.local/api/access
method: POST
content_type: application/json
Observability for Data APIs
A data API gateway should emit signals that answer: who is consuming what data, how fast, and with what freshness?
Key metrics to track:
- Request rate per consumer — identify heavy users before they hit quotas
- Latency p95/p99 per endpoint — data queries have long tails; median is misleading
- Cache hit rate — poor hit rates mean expensive warehouse queries on every request
- Error rate by type — 429s (quota) vs 503s (upstream unavailable) need different responses
- Data freshness of served responses — critical for consumers who need near-real-time data
Combining gateway metrics with data platform observability (table freshness, pipeline health) in a unified view — as Harbinger Explorer provides — gives teams the full picture from raw ingestion through to API consumption.
Summary
API gateways for data platforms aren't just a security checkbox — they're the foundation of a governed, scalable data serving layer. The patterns that work in production:
- Transformation gateways decouple API contracts from warehouse internals — always worth the investment
- Scope-based authorization with JWT is more flexible than API keys for complex permission models
- URL versioning with Sunset headers gives downstream consumers a reliable deprecation signal
- Data mesh gateways federating domain APIs work best when backed by a data catalog for discovery
- Kong or AWS API Gateway for rate limiting — don't build this yourself
Try Harbinger Explorer free for 7 days — track API consumption patterns across your data products, get visibility into consumer dependencies, and manage data API governance at scale.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
Cloud-Agnostic Data Lakehouse: Portable Architectures
A practical architecture guide for building cloud-portable data lakehouses with Terraform, Delta Lake, and Apache Iceberg — including comparison tables, decision frameworks, and cost trade-offs.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial