Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

API Gateway Architecture Patterns for Data Platforms

13 min read·Tags: api-gateway, data-platform, data-mesh, rest-api, rate-limiting, platform-engineering

API Gateway Architecture Patterns for Data Platforms

Data platforms have traditionally served data through direct warehouse connections, JDBC endpoints, and blob storage presigned URLs. As organizations mature, these ad-hoc access patterns create governance nightmares: who has access to what, how much are they querying, and is the API contract stable enough for downstream teams to depend on?

API gateways solve this by inserting a managed control plane between data consumers and data systems. This guide covers the patterns that work in production for data platform teams.


Why Data Platforms Need API Gateways

The case for gateways isn't just about security theatre. It's about operational capability:

Problem Without GatewayGateway Solution
Direct warehouse connections hit quota limitsRate limiting per consumer
No visibility into who's querying whatCentralized access logging
Consumers tightly coupled to warehouse internalsSchema abstraction layer
Auth handled ad-hoc per serviceCentralized OAuth/API key auth
No way to deprecate old schemas safelyAPI versioning + sunset headers
Cross-team data access negotiated manuallySelf-service data product APIs

Gateway Architecture Patterns

Pattern 1: Passthrough Gateway (Simple)

The simplest pattern: gateway handles auth and rate limiting, passes requests directly to the warehouse or data service.

Loading diagram...

Good for: small teams, internal APIs, early-stage data products.

Limitation: tightly couples API schema to warehouse schema — any warehouse refactor breaks consumers.

Pattern 2: Transformation Gateway (Recommended)

The gateway applies a transformation layer: incoming REST requests are translated into warehouse queries, responses are shaped before returning.

Loading diagram...

This pattern enables stable API contracts that survive warehouse refactors. The transformation layer owns the mapping between API schema (what consumers see) and warehouse schema (what actually stores the data).

Pattern 3: Data Mesh Gateway (Advanced)

In a data mesh architecture, the gateway is the entry point to domain-owned data products. Each domain exposes its data through a standardized API contract; the gateway provides discovery, federation, and cross-domain lineage.

Loading diagram...

The central gateway in a data mesh context handles:

  • API discovery (which data products exist and what they expose)
  • Cross-domain auth (consumers auth once, gateway negotiates domain permissions)
  • Lineage tracking (which consumers depend on which data products)

Harbinger Explorer is well-suited to the discovery and lineage layer in this pattern — it maintains the cross-domain dependency graph that makes data mesh governance tractable.


Implementation: AWS API Gateway + Lambda Authorizer

Terraform Configuration

# API Gateway for Data Platform
resource "aws_api_gateway_rest_api" "data_platform" {
  name        = "data-platform-api"
  description = "Central API gateway for data platform products"

  endpoint_configuration {
    types = ["REGIONAL"]
  }

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = "execute-api:Invoke"
      Resource  = "arn:aws:execute-api:*:*:*"
      Condition = {
        IpAddress = {
          "aws:SourceIp" = var.allowed_cidr_ranges
        }
      }
    }]
  })
}

resource "aws_api_gateway_deployment" "data_platform" {
  rest_api_id = aws_api_gateway_rest_api.data_platform.id

  triggers = {
    redeployment = sha1(jsonencode(aws_api_gateway_rest_api.data_platform.body))
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_api_gateway_stage" "production" {
  deployment_id = aws_api_gateway_deployment.data_platform.id
  rest_api_id   = aws_api_gateway_rest_api.data_platform.id
  stage_name    = "v1"

  access_log_settings {
    destination_arn = aws_cloudwatch_log_group.api_access_log.arn
    format = jsonencode({
      requestId      = "$context.requestId"
      sourceIp       = "$context.identity.sourceIp"
      requestTime    = "$context.requestTime"
      protocol       = "$context.protocol"
      httpMethod     = "$context.httpMethod"
      resourcePath   = "$context.resourcePath"
      routeKey       = "$context.routeKey"
      status         = "$context.status"
      responseLength = "$context.responseLength"
      integrationLatency = "$context.integrationLatency"
      userAgent      = "$context.identity.userAgent"
      # Custom: data platform tracking
      consumerId     = "$context.authorizer.consumerId"
      dataProduct    = "$context.authorizer.dataProduct"
    })
  }

  default_route_settings {
    throttling_burst_limit = 100
    throttling_rate_limit  = 50
  }
}

# Usage plans for rate limiting per consumer tier
resource "aws_api_gateway_usage_plan" "free_tier" {
  name = "data-platform-free"

  api_stages {
    api_id = aws_api_gateway_rest_api.data_platform.id
    stage  = aws_api_gateway_stage.production.stage_name
  }

  throttle_settings {
    burst_limit = 10
    rate_limit  = 5
  }

  quota_settings {
    limit  = 10000
    period = "MONTH"
  }
}

resource "aws_api_gateway_usage_plan" "professional" {
  name = "data-platform-professional"

  api_stages {
    api_id = aws_api_gateway_rest_api.data_platform.id
    stage  = aws_api_gateway_stage.production.stage_name
  }

  throttle_settings {
    burst_limit = 500
    rate_limit  = 100
  }

  quota_settings {
    limit  = 1000000
    period = "MONTH"
  }
}

Lambda Authorizer for JWT Validation

# lambda_authorizer.py
import json
import os
import re
from typing import Optional
import boto3
import jwt
from jwt import PyJWKClient

JWKS_URI = os.environ["JWKS_URI"]  # e.g. https://auth.company.com/.well-known/jwks.json
AUDIENCE = os.environ["TOKEN_AUDIENCE"]

jwks_client = PyJWKClient(JWKS_URI, cache_keys=True)

def handler(event: dict, context) -> dict:
    '''Lambda authorizer: validate JWT and return IAM policy.'''
    token = extract_token(event)
    
    if not token:
        raise Exception("Unauthorized")
    
    try:
        signing_key = jwks_client.get_signing_key_from_jwt(token)
        payload = jwt.decode(
            token,
            signing_key.key,
            algorithms=["RS256"],
            audience=AUDIENCE
        )
    except jwt.ExpiredSignatureError:
        raise Exception("Unauthorized: token expired")
    except jwt.InvalidTokenError as e:
        raise Exception(f"Unauthorized: {e}")
    
    consumer_id = payload.get("sub")
    scopes = payload.get("scope", "").split()
    
    # Map scopes to API Gateway resource permissions
    policy = build_policy(consumer_id, scopes, event["methodArn"])
    policy["context"] = {
        "consumerId": consumer_id,
        "scopes": " ".join(scopes),
        "dataProduct": extract_data_product(event["methodArn"])
    }
    
    return policy


def extract_token(event: dict) -> Optional[str]:
    auth_header = event.get("authorizationToken", "")
    if auth_header.startswith("Bearer "):
        return auth_header[7:]
    return event.get("queryStringParameters", {}).get("token")


def build_policy(principal: str, scopes: list, method_arn: str) -> dict:
    # Parse ARN to determine which resources to allow
    arn_parts = method_arn.split(":")
    region = arn_parts[3]
    account = arn_parts[4]
    api_id = arn_parts[5].split("/")[0]
    stage = arn_parts[5].split("/")[1]
    
    allowed_resources = []
    
    # Map scopes to allowed resource paths
    scope_resource_map = {
        "data:orders:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/orders*",
        "data:customers:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/customers*",
        "data:inventory:read": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/GET/v1/inventory*",
        "data:admin": f"arn:aws:execute-api:{region}:{account}:{api_id}/{stage}/*/*",
    }
    
    for scope in scopes:
        if scope in scope_resource_map:
            allowed_resources.append(scope_resource_map[scope])
    
    return {
        "principalId": principal,
        "policyDocument": {
            "Version": "2012-10-17",
            "Statement": [{
                "Action": "execute-api:Invoke",
                "Effect": "Allow" if allowed_resources else "Deny",
                "Resource": allowed_resources or [method_arn]
            }]
        }
    }


def extract_data_product(method_arn: str) -> str:
    # Extract data product name from ARN path
    parts = method_arn.split("/")
    if len(parts) >= 4:
        return parts[3]  # e.g. "orders" from /v1/orders
    return "unknown"

API Versioning Strategy

Data API versioning requires more careful thought than typical REST APIs because downstream consumers often run batch jobs that can't be updated instantaneously.

URL-Based Versioning (Recommended for Data APIs)

/v1/orders          → stable, supported
/v2/orders          → new schema, active development
/v1/orders [Sunset: 2025-09-01] → deprecated, add Sunset header

Always include Sunset and Deprecation headers for deprecated versions:

# FastAPI: data product endpoint with versioning headers
from fastapi import FastAPI, Response
from datetime import datetime

app = FastAPI()

@app.get("/v1/orders")
async def get_orders_v1(response: Response):
    # V1 is deprecated — add sunset headers
    response.headers["Deprecation"] = "true"
    response.headers["Sunset"] = "Sat, 01 Sep 2025 00:00:00 GMT"
    response.headers["Link"] = '</v2/orders>; rel="successor-version"'
    
    # Return V1 schema (legacy format)
    return {"orders": [], "total": 0, "page": 1}

@app.get("/v2/orders")
async def get_orders_v2(
    response: Response,
    date_from: str = None,
    date_to: str = None,
    status: str = None,
    limit: int = 100,
    cursor: str = None
):
    # V2: cursor-based pagination, ISO dates, richer filtering
    response.headers["X-Data-Version"] = "2.0"
    
    return {
        "data": [],
        "pagination": {
            "cursor": None,
            "has_more": False,
            "limit": limit
        },
        "meta": {
            "generated_at": datetime.utcnow().isoformat(),
            "data_freshness_seconds": 30
        }
    }

Rate Limiting Patterns

Kong Gateway Configuration

For teams using Kong as their gateway layer:

# Kong declarative config (deck format)
services:
  - name: orders-data-product
    url: http://orders-service.data-platform.svc.cluster.local:8080
    routes:
      - name: orders-api
        paths:
          - /v1/orders
          - /v2/orders
        methods:
          - GET
    plugins:
      - name: rate-limiting
        config:
          minute: 60
          hour: 1000
          policy: redis
          redis_host: redis.infra.svc.cluster.local
          redis_port: 6379
          redis_database: 1
          limit_by: consumer
          
      - name: jwt
        config:
          secret_is_base64: false
          claims_to_verify:
            - exp
            - nbf
            
      - name: request-transformer
        config:
          add:
            headers:
              - "X-Consumer-ID:$(consumer.id)"
              - "X-Data-Platform-Gateway:true"
              
      - name: response-transformer
        config:
          add:
            headers:
              - "X-Rate-Limit-Info:see X-RateLimit-* headers"
              
      - name: http-log
        config:
          http_endpoint: http://audit-log.data-platform.svc.cluster.local/api/access
          method: POST
          content_type: application/json

Observability for Data APIs

A data API gateway should emit signals that answer: who is consuming what data, how fast, and with what freshness?

Key metrics to track:

  • Request rate per consumer — identify heavy users before they hit quotas
  • Latency p95/p99 per endpoint — data queries have long tails; median is misleading
  • Cache hit rate — poor hit rates mean expensive warehouse queries on every request
  • Error rate by type — 429s (quota) vs 503s (upstream unavailable) need different responses
  • Data freshness of served responses — critical for consumers who need near-real-time data

Combining gateway metrics with data platform observability (table freshness, pipeline health) in a unified view — as Harbinger Explorer provides — gives teams the full picture from raw ingestion through to API consumption.


Summary

API gateways for data platforms aren't just a security checkbox — they're the foundation of a governed, scalable data serving layer. The patterns that work in production:

  1. Transformation gateways decouple API contracts from warehouse internals — always worth the investment
  2. Scope-based authorization with JWT is more flexible than API keys for complex permission models
  3. URL versioning with Sunset headers gives downstream consumers a reliable deprecation signal
  4. Data mesh gateways federating domain APIs work best when backed by a data catalog for discovery
  5. Kong or AWS API Gateway for rate limiting — don't build this yourself

Try Harbinger Explorer free for 7 days — track API consumption patterns across your data products, get visibility into consumer dependencies, and manage data API governance at scale.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...