GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
Building cloud data platforms that are both powerful and GDPR-compliant is one of the most nuanced engineering challenges of our era. The regulation isn't just a legal checkbox — it fundamentally shapes how you architect data pipelines, choose cloud services, and manage the lifecycle of personal data. This guide walks through the technical realities of achieving GDPR compliance in modern cloud data stacks, complete with infrastructure-as-code examples and reference architectures.
Why GDPR Is an Engineering Problem
Most teams treat GDPR as a legal problem and hand it off to compliance teams. That's a mistake. At its core, GDPR is about data architecture:
- Article 5 — Data minimisation, purpose limitation, storage limitation
- Article 17 — Right to erasure ("right to be forgotten")
- Article 20 — Data portability
- Article 25 — Data protection by design and by default
- Article 32 — Security of processing (encryption, pseudonymisation)
- Article 35 — Data Protection Impact Assessments (DPIA)
Each of these has direct implications for how you design your ingestion layers, storage, access control, and APIs. Engineers own this.
Reference Architecture: GDPR-Compliant Cloud Data Platform
The following diagram illustrates a reference architecture for a GDPR-compliant data platform on AWS or GCP:
Loading diagram...
Key Architectural Tenets
- PII never enters the raw zone unclassified
- Pseudonymisation tokens are the only reference to PII in analytics
- The PII Vault is the single source of truth for personal data
- All access is logged immutably
- Erasure is automated and verifiable
Terraform: Building the Compliance Infrastructure
Let's look at concrete Terraform for a GCP-based compliant data platform.
1. Encrypted Storage Buckets with Data Retention Policies
resource "google_storage_bucket" "raw_zone" {
name = "${var.project_id}-raw-zone"
location = "EU"
storage_class = "STANDARD"
# Enforce encryption at rest with CMEK
encryption {
default_kms_key_name = google_kms_crypto_key.data_key.id
}
# Enforce retention — storage limitation (Art. 5)
retention_policy {
is_locked = true
retention_period = 7776000 # 90 days in seconds
}
# Prevent public access
uniform_bucket_level_access = true
# Versioning for audit trail
versioning {
enabled = true
}
lifecycle_rule {
condition {
age = 90
}
action {
type = "Delete"
}
}
}
resource "google_kms_key_ring" "gdpr_ring" {
name = "gdpr-keyring"
location = "europe-west3"
}
resource "google_kms_crypto_key" "data_key" {
name = "gdpr-data-key"
key_ring = google_kms_key_ring.gdpr_ring.id
rotation_period = "7776000s" # 90-day rotation
lifecycle {
prevent_destroy = true
}
}
2. IAM: Least-Privilege Access (Art. 25 — Privacy by Design)
# Data Engineer role — can read pseudonymised data only
resource "google_project_iam_custom_role" "data_engineer" {
role_id = "dataEngineerGDPR"
title = "Data Engineer (GDPR Compliant)"
description = "Access to pseudonymised zones only — no PII Vault"
permissions = [
"bigquery.tables.getData",
"bigquery.tables.list",
"bigquery.jobs.create",
"storage.objects.get",
"storage.objects.list",
]
}
# PII Vault access — restricted to compliance service account only
resource "google_storage_bucket_iam_binding" "pii_vault_access" {
bucket = google_storage_bucket.pii_vault.name
role = "roles/storage.objectViewer"
members = [
"serviceAccount:${google_service_account.erasure_service.email}",
"serviceAccount:${google_service_account.portability_service.email}",
]
}
# Deny all other access to PII Vault
resource "google_storage_bucket_iam_deny" "pii_vault_deny_all" {
bucket = google_storage_bucket.pii_vault.name
deny_policy {
deny_conditions {
title = "Deny non-service-accounts"
expression = "!resource.name.startsWith('projects/_/serviceAccounts/')"
}
denied_permissions = ["storage.objects.get"]
}
}
3. VPC Service Controls — Data Exfiltration Prevention
resource "google_access_context_manager_service_perimeter" "gdpr_perimeter" {
parent = "accessPolicies/${var.access_policy_id}"
name = "accessPolicies/${var.access_policy_id}/servicePerimeters/gdpr_perimeter"
title = "GDPR Data Perimeter"
status {
resources = [
"projects/${var.project_number}",
]
restricted_services = [
"bigquery.googleapis.com",
"storage.googleapis.com",
"dataflow.googleapis.com",
]
ingress_policies {
ingress_from {
identity_type = "SERVICE_ACCOUNT"
identities = ["serviceAccount:${var.pipeline_sa}"]
}
ingress_to {
resources = ["*"]
operations {
service_name = "bigquery.googleapis.com"
method_selectors {
method = "BigQueryStorage.ReadRows"
}
}
}
}
}
}
Kubernetes: Deploying the Pseudonymisation Service
The pseudonymisation service is the heart of your GDPR architecture. Here's the Kubernetes manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pseudonymisation-service
namespace: gdpr-compliance
labels:
app: pseudonymisation-service
gdpr-component: "true"
spec:
replicas: 3
selector:
matchLabels:
app: pseudonymisation-service
template:
metadata:
labels:
app: pseudonymisation-service
annotations:
# Force pod restart on key rotation
secret-hash: "${SHA256_OF_KEY}"
spec:
serviceAccountName: pseudonymisation-sa
securityContext:
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
containers:
- name: pseudonymisation
image: gcr.io/${PROJECT_ID}/pseudonymisation-service:1.4.2
ports:
- containerPort: 8080
env:
- name: KMS_KEY_NAME
valueFrom:
secretKeyRef:
name: gdpr-secrets
key: kms-key-name
- name: VAULT_BUCKET
value: ${PII_VAULT_BUCKET}
- name: AUDIT_LOG_TOPIC
value: projects/${PROJECT_ID}/topics/gdpr-audit
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: pseudonymisation-netpol
namespace: gdpr-compliance
spec:
podSelector:
matchLabels:
app: pseudonymisation-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: data-pipeline
ports:
- protocol: TCP
port: 8080
egress:
- to: [] # Only KMS and GCS via VPC SC
ports:
- protocol: TCP
port: 443
Data Catalog: Classifying PII at Ingestion
A crucial part of GDPR compliance is knowing what data you have. Use a YAML-based data catalog that feeds your classification engine:
# data-catalog/schemas/user_events.yaml
schema:
name: user_events
version: "2.1"
gdpr_classification: "personal_data"
dpia_required: true
retention_days: 90
legal_basis: "legitimate_interest"
fields:
- name: event_id
type: UUID
pii: false
- name: user_id
type: string
pii: true
pii_category: "indirect_identifier"
pseudonymisation: "token_replace"
vault_key: "user_tokens"
- name: email
type: string
pii: true
pii_category: "contact_data"
pseudonymisation: "hash_hmac_sha256"
erasable: true
- name: ip_address
type: string
pii: true
pii_category: "online_identifier"
pseudonymisation: "ip_masking"
masking_strategy: "last_octet"
- name: event_type
type: string
pii: false
- name: timestamp
type: timestamp
pii: false
retention_trigger: true
Comparison: Cloud Provider GDPR Tooling
| Feature | AWS | GCP | Azure |
|---|---|---|---|
| Data Residency | Region-specific S3, RDS | Regional GCS, BigQuery | Geo-restricted Azure Storage |
| CMEK Support | AWS KMS + SSE-KMS | Cloud KMS + CMEK | Azure Key Vault + CMK |
| Data Classification | Amazon Macie | Cloud DLP API | Azure Purview |
| Audit Logging | CloudTrail | Cloud Audit Logs | Azure Monitor + Activity Log |
| Data Erasure | Manual + Lambda | Cloud DLP deidentify | Azure Data Subject Requests |
| VPC Isolation | VPC + PrivateLink | VPC SC + Private Service Connect | VNet + Private Endpoints |
| PII Detection | Macie (S3 only) | DLP (text, images, structured) | Purview (broad but slower) |
| Compliance Reports | AWS Artifact | Compliance Reports Manager | Microsoft Service Trust Portal |
| SCCs / Org Policies | Service Control Policies | Organization Policies | Azure Policy |
| EU Data Boundary | ✅ AWS EU Boundary | ✅ GCP EU Boundary | ✅ Azure EU Boundary |
Verdict: GCP's Cloud DLP API has the most mature automated PII detection. AWS Macie is S3-only but deeply integrated. Azure Purview is catching up but remains complex to configure.
Implementing the Right to Erasure (Art. 17)
The right to be forgotten is technically the hardest GDPR requirement in data platforms. Here's a practical approach:
Erasure Workflow
Loading diagram...
Key Erasure Strategies
Strategy 1 — Token Invalidation (recommended for analytics) Don't delete records in BigQuery. Instead, invalidate the pseudonymisation token. All analytics referencing that user_id now resolve to NULL. No table scans needed.
Strategy 2 — Crypto Shredding Encrypt data with a user-specific key stored separately. Deleting the key makes all data unreadable. Works well for object storage.
Strategy 3 — Tombstoning Mark records as deleted in a deletion log table. Filter every query through this log. Simple but adds query overhead.
Data Processing Agreements and Cross-Border Transfers
Standard Contractual Clauses (SCCs) for Cloud Processors
When data leaves the EU — even to a US-based cloud service — you need SCCs. Map your data flows:
| Data Flow | Transfer Mechanism | Risk Level |
|---|---|---|
| EU → AWS EU (Ireland/Frankfurt) | Within EEA — no SCC needed | 🟢 Low |
| EU → AWS US | SCC Module 2 (Controller → Processor) | 🟡 Medium |
| EU → Subprocessors (e.g., Datadog) | SCC in DPA + Article 28 clauses | 🟡 Medium |
| EU → China/Russia | Adequacy decision absent — generally prohibited | 🔴 High |
| EU → Canada | Adequacy decision in place | 🟢 Low |
Monitoring and Alerting for GDPR Incidents
Under GDPR Article 33, you have 72 hours to notify the supervisory authority of a personal data breach. Your monitoring must be fast:
# alerting/gdpr-breach-detection.yaml
alerts:
- name: Unauthorized PII Access
condition: >
SELECT COUNT(*) FROM audit_logs
WHERE resource = 'pii_vault'
AND principal NOT IN (SELECT sa FROM allowed_principals)
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE)
threshold: 1
severity: CRITICAL
notification:
- channel: pagerduty
policy: gdpr-incident
- channel: email
recipients:
- dpo@company.com
- cto@company.com
sla_hours: 72 # GDPR breach notification window
- name: Bulk Data Export Anomaly
condition: >
bytes_exported > 10GB AND timeframe = '1h'
AND NOT in_approved_jobs
severity: HIGH
- name: Retention Policy Violation
condition: >
data_age_days > retention_policy_days
AND data_classification IN ('personal_data', 'sensitive_data')
severity: MEDIUM
auto_remediate: true
remediation: trigger_erasure_workflow
GDPR Compliance Checklist for Cloud Data Engineers
| Control | Implementation | Status Check |
|---|---|---|
| Data Inventory | Automated catalog with PII tagging | Scan new tables on ingestion |
| Consent Management | Consent flags in user profiles | Block processing if no consent |
| Data Minimisation | Schema-level field necessity review | Quarterly schema audits |
| Pseudonymisation | Token vault with HMAC-SHA256 | Pen test token reversibility |
| Encryption at Rest | CMEK with 90-day rotation | Key rotation alerts |
| Encryption in Transit | TLS 1.3 enforced at load balancer | TLS scan weekly |
| Access Control | RBAC with principle of least privilege | Quarterly access reviews |
| Audit Logging | Immutable logs, 3-year retention | Log integrity checks daily |
| Right to Erasure | Automated within 30 days | Monthly erasure SLA report |
| Data Portability | Machine-readable export API | Quarterly API testing |
| DPIA Documentation | For high-risk processing activities | Before new data types |
| Breach Detection | < 24h detection, < 72h notification | Incident drill biannual |
| DPA with Processors | Signed SCCs with all vendors | Annual DPA audit |
Conclusion
GDPR compliance in cloud data platforms is achievable — but only if you treat it as an engineering discipline, not a legal afterthought. The key principles are:
- Build PII isolation into your architecture from day one — retrofitting is 10x harder
- Automate everything — manual compliance processes fail under scale
- Pseudonymise, don't anonymise — true anonymisation is nearly impossible; pseudonymisation is tractable
- Make erasure cheap — crypto shredding and token invalidation are your friends
- Log everything immutably — when the regulator asks, you need receipts
The architectures and code in this guide are battle-tested patterns for teams building on AWS, GCP, or Azure. Start with the data catalog and PII classifier — everything else follows from knowing what data you have.
Ready to build a GDPR-compliant geopolitical intelligence platform?
Harbinger Explorer processes global event data from hundreds of sources with privacy-by-design architecture built in.
Continue Reading
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Cloud-Agnostic Data Lakehouse: Portable Architectures
A practical architecture guide for building cloud-portable data lakehouses with Terraform, Delta Lake, and Apache Iceberg — including comparison tables, decision frameworks, and cost trade-offs.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial