Zero Trust Architecture for Data Platforms
Zero Trust Architecture for Data Platforms
"Never trust, always verify" — the zero trust principle — was coined for network security, but it's increasingly the right mental model for data platform access control. The perimeter-based model assumes that anything inside your VPC is safe. Modern data platforms span cloud accounts, regions, third-party services, and a workforce that accesses data from coffee shops. The perimeter is gone.
This guide covers how to implement zero trust principles specifically for data platforms: identity-first access, attribute-based controls, encryption at every layer, and continuous verification.
Why Data Platforms Are High-Value Targets
Data platforms aggregate the most sensitive information an organisation has:
- PII at scale (millions of customer records in one query)
- Financial data in analytical models
- Intellectual property in ML training sets
- Operational data that reveals business strategy
A compromised data warehouse isn't just a GDPR violation — it's potentially every trade secret the organisation has, queryable via SQL.
The traditional answer (VPC isolation + IP allowlisting) fails because:
- Most data is now in managed cloud services that don't live "inside" your VPC
- Analytical access requires broad read permissions that are difficult to scope
- Service accounts accumulate excessive permissions over time
Zero Trust Principles Applied to Data
Loading diagram...
Layer 1: Identity-First Data Access
Eliminate Service Account Key Files
Long-lived key files are the most common vector for data platform compromises. Replace them with short-lived credential exchange:
Loading diagram...
Terraform — OIDC trust policy for GitHub Actions:
data "aws_iam_policy_document" "github_actions_trust" {
statement {
effect = "Allow"
actions = ["sts:AssumeRoleWithWebIdentity"]
principals {
type = "Federated"
identifiers = [aws_iam_openid_connect_provider.github.arn]
}
condition {
test = "StringLike"
variable = "token.actions.githubusercontent.com:sub"
values = ["repo:myorg/data-platform:ref:refs/heads/main"]
}
condition {
test = "StringEquals"
variable = "token.actions.githubusercontent.com:aud"
values = ["sts.amazonaws.com"]
}
}
}
resource "aws_iam_role" "pipeline_execution" {
name = "data-pipeline-cicd"
assume_role_policy = data.aws_iam_policy_document.github_actions_trust.json
max_session_duration = 3600 # 1 hour max
}
Databricks Unity Catalog — Identity Federation
# Databricks Unity Catalog with SCIM provisioning
resource "databricks_user" "data_engineer" {
for_each = var.data_engineer_emails
user_name = each.value
display_name = each.key
# SCIM handles provisioning/deprovisioning from IdP
# No local password — SSO only
force_delete_repos = true
force_delete_home_dir = true
}
resource "databricks_group_member" "de_team" {
for_each = var.data_engineer_emails
group_id = databricks_group.data_engineers.id
member_id = databricks_user.data_engineer[each.key].id
}
# Grant table access to group, not individuals
resource "databricks_grants" "silver_layer" {
table = "main.silver.customer_events"
grant {
principal = "data-engineers"
privileges = ["SELECT", "MODIFY"]
}
grant {
principal = "analysts"
privileges = ["SELECT"]
}
}
Layer 2: Attribute-Based Access Control (ABAC)
Role-based access control (RBAC) doesn't scale for data platforms. When you have 500 tables, 50 teams, and 3 environments, the RBAC matrix explodes. ABAC uses data attributes (classification, domain, sensitivity) and user attributes (team, clearance, location) to compute access dynamically.
Data Classification Tags
# Tag every data asset at creation
resource "aws_glue_catalog_table" "customer_pii" {
name = "customer_profiles"
database_name = aws_glue_catalog_database.silver.name
parameters = {
"data_classification" = "PII"
"data_domain" = "customer"
"sensitivity" = "high"
"gdpr_relevant" = "true"
"retention_days" = "730"
"owner_team" = "customer-platform"
}
# ... schema definition
}
Lake Formation — ABAC Tag Policy
# Grant access based on data classification tags, not specific tables
resource "aws_lakeformation_tag" "classification" {
key = "data_classification"
values = ["public", "internal", "confidential", "PII", "restricted"]
}
# Data engineers can access internal and confidential, not PII
resource "aws_lakeformation_tag_association" "engineer_access" {
principal {
iam_arn = "arn:aws:iam::123456789:role/data-engineers"
}
lf_tag_policy {
resource_type = "TABLE"
expression {
key = "data_classification"
values = ["public", "internal", "confidential"]
}
}
permissions = ["SELECT", "DESCRIBE"]
}
# PII access requires explicit DPO approval (separate role)
resource "aws_lakeformation_tag_association" "pii_approved_access" {
principal {
iam_arn = "arn:aws:iam::123456789:role/pii-approved-analysts"
}
lf_tag_policy {
resource_type = "TABLE"
expression {
key = "data_classification"
values = ["PII"]
}
}
permissions = ["SELECT"]
permissions_with_grant_option = []
}
Layer 3: Column-Level Security and Data Masking
Even users with table access shouldn't always see all columns. Column-level security with dynamic masking implements this without duplicating data.
BigQuery Column-Level Security
-- Create a policy tag taxonomy
-- (done via Data Catalog API or Terraform)
-- Assign policy tag to sensitive column
CREATE OR REPLACE TABLE analytics.customer_orders (
order_id STRING,
customer_id STRING,
email STRING OPTIONS (
description='PII — protected by policy tag',
policy_tags='"projects/my-project/locations/us/taxonomies/12345/policyTags/67890"'
),
amount_usd NUMERIC,
created_at TIMESTAMP
);
-- Analysts without "PII Viewer" role see:
-- SELECT * → email column returns NULL or REDACTED
-- No error, no indication that data is being masked
Snowflake Dynamic Data Masking
-- Create masking policy
CREATE OR REPLACE MASKING POLICY pii_email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('PII_APPROVED_ANALYST', 'DPO_TEAM') THEN val
WHEN CURRENT_ROLE() = 'ANALYST' THEN
REGEXP_REPLACE(val, '(.{2}).*(@.*)', '\1***\2') -- partial mask
ELSE '***REDACTED***'
END;
-- Apply to column
ALTER TABLE customer_orders
MODIFY COLUMN email
SET MASKING POLICY pii_email_mask;
-- Test as analyst role:
USE ROLE ANALYST;
SELECT email FROM customer_orders LIMIT 5;
-- Returns: jo***@example.com, ma***@company.org, ...
Layer 4: Network Micro-Segmentation
Private Endpoints for All Data Services
Loading diagram...
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.data_platform.id
service_name = "com.amazonaws.${var.region}.s3"
route_table_ids = aws_route_table.private[*].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = "*"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [
aws_s3_bucket.lakehouse.arn,
"${aws_s3_bucket.lakehouse.arn}/*"
]
}]
})
}
# Restrict S3 bucket to VPC endpoint only
resource "aws_s3_bucket_policy" "lakehouse_vpc_only" {
bucket = aws_s3_bucket.lakehouse.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Deny"
Principal = "*"
Action = "s3:*"
Resource = [
aws_s3_bucket.lakehouse.arn,
"${aws_s3_bucket.lakehouse.arn}/*"
]
Condition = {
StringNotEquals = {
"aws:sourceVpce" = aws_vpc_endpoint.s3.id
}
}
}]
})
}
Layer 5: Encryption at Every Layer
Encryption Architecture
Loading diagram...
# Separate KMS keys per data classification
resource "aws_kms_key" "pii_data" {
description = "PII data encryption — data platform"
deletion_window_in_days = 30
enable_key_rotation = true
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "Enable DPO team management"
Effect = "Allow"
Principal = { AWS = var.dpo_team_role_arn }
Action = ["kms:*"]
Resource = "*"
},
{
Sid = "Allow approved roles to use key"
Effect = "Allow"
Principal = { AWS = [
var.pii_pipeline_role_arn,
var.pii_analyst_role_arn
]}
Action = ["kms:GenerateDataKey", "kms:Decrypt"]
Resource = "*"
},
{
Sid = "Deny all others"
Effect = "Deny"
Principal = { AWS = "*" }
Action = ["kms:GenerateDataKey", "kms:Decrypt"]
Resource = "*"
Condition = {
StringNotLike = {
"aws:PrincipalArn" = [
var.dpo_team_role_arn,
var.pii_pipeline_role_arn,
var.pii_analyst_role_arn
]
}
}
}
]
})
}
Layer 6: Continuous Verification and Anomaly Detection
Zero trust isn't "verify once and trust." It's continuous.
Query Anomaly Detection
# Pseudocode for query audit log analysis
# Run as a scheduled Spark job on CloudTrail / audit logs
from pyspark.sql import functions as F
audit_logs = spark.table("security.data_access_audit")
# Detect unusual data volume access
anomalies = (
audit_logs
.where(F.col("event_date") == F.current_date())
.groupBy("principal_id", "table_name")
.agg(
F.sum("bytes_scanned").alias("bytes_today"),
F.count("*").alias("query_count")
)
.join(
# Compare against 30-day baseline
audit_logs
.where(F.col("event_date") >= F.date_sub(F.current_date(), 30))
.groupBy("principal_id", "table_name")
.agg((F.sum("bytes_scanned") / 30).alias("avg_daily_bytes")),
on=["principal_id", "table_name"],
how="left"
)
.where(F.col("bytes_today") > F.col("avg_daily_bytes") * 10) # 10x spike
)
# Alert via PagerDuty / Slack
anomalies.foreach(lambda row: alert_security_team(row))
Harbinger Explorer for API Access Auditing
When your data platform exposes APIs (and they all do — from Athena federation endpoints to custom REST APIs), you need continuous visibility into which endpoints are being called, with what parameters, and whether responses match expected schemas. Harbinger Explorer provides this testing and monitoring layer, letting you catch unexpected access patterns or schema deviations before they become security incidents.
Zero Trust Maturity Model
| Level | Description | Key Controls |
|---|---|---|
| 0 — Implicit trust | VPC = trusted; anyone inside can query anything | None |
| 1 — Identity-aware | Authentication required; coarse RBAC | SSO, basic roles |
| 2 — Data-aware | ABAC on data classification; column masking | Policy tags, masking policies |
| 3 — Context-aware | Access varies by time, location, device posture | Conditional access, MFA step-up |
| 4 — Continuous | Every query re-evaluated; anomaly detection; immutable audit logs | SIEM integration, ML anomaly detection |
Most mature data platforms operate at Level 2-3. Level 4 is appropriate for organisations handling financial services data, healthcare records, or government information.
Summary
Zero trust for data platforms is a layered discipline: identity-first authentication eliminates the key file problem; ABAC scales access control beyond what RBAC can manage; column-level masking protects sensitive fields without data duplication; network micro-segmentation removes lateral movement; and continuous verification catches anomalies before they become breaches.
Start with Layer 1 (eliminate key files, enforce SSO) and Layer 2 (classify your data, apply ABAC). The impact-to-effort ratio is highest there, and it builds the foundation for the deeper controls.
Try Harbinger Explorer free for 7 days — validate your data API security posture, test that your access controls return correct responses, and monitor for unexpected access patterns across your data platform endpoints. harbingerexplorer.com
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial