cloud-architecture

Published: Apr 6, 2026

Data Catalog Federation Across Cloud Platforms

10 min read·Tags: data catalog, federation, iceberg, unity catalog, apache polaris, gravitino, multi-cloud, metadata

Every platform team eventually hits the same wall: your data lives in three clouds, two on-prem systems, and a SaaS vendor's API — and nobody can find anything. You have catalogs, plural. What you don't have is a catalog.

Data catalog federation is the architectural pattern that solves this without forcing a rip-and-replace migration. Instead of consolidating all metadata into one tool, you connect multiple catalogs through shared protocols, creating a unified discovery layer while letting each domain keep its own catalog autonomy.

This article breaks down the federation patterns that actually work, compares the major catalog options, and gives you a decision framework for choosing the right architecture.

Why Single-Catalog Strategies Fail at Scale

The instinct is understandable: pick one catalog, make everyone use it, declare victory. In practice, this fails for three reasons:

Acquisitions and mergers bring entirely new data stacks overnight
Multi-cloud strategies mean each cloud provider's managed services already have their own catalog (AWS Glue, Google Dataplex, Azure Purview)
Domain autonomy — data mesh or not, different teams choose different tools

Forcing consolidation creates a metadata migration project that's just as painful as a data migration. Federation sidesteps this by treating catalogs as peers in a network rather than competitors in a winner-take-all contest.

The Three Federation Patterns

Not all federation is created equal. The pattern you choose determines your complexity budget, governance model, and vendor flexibility.

Pattern 1: Hub-and-Spoke (Central Aggregator)

One catalog serves as the primary discovery layer and pulls metadata from satellite catalogs. Users search one place; governance policies are enforced centrally.

Examples: Databricks Unity Catalog with Lakehouse Federation, Atlan connecting to multiple sources, Collibra as governance hub.

Best for: Organizations with a clear primary analytics platform that want unified governance without migrating data.

Loading diagram...

① Satellite catalogs (Glue, Purview, Dataplex, HMS) maintain their own metadata independently ② The central hub pulls metadata on a schedule or via change events ③ Users query the hub for discovery; governance policies apply at the hub layer ④ Data stays in place — only metadata moves

🔵 Central Hub · 🟡 AWS · 🟣 Azure · 🟢 GCP · 🔴 Legacy

Pattern 2: Protocol-Based Federation (Iceberg REST API)

Instead of a central aggregator, catalogs expose a shared protocol — specifically the Iceberg REST Catalog API. Any engine that speaks the protocol can discover and access tables from any compatible catalog.

Examples: Apache Polaris, Apache Gravitino, Unity Catalog (as Iceberg REST endpoint), Project Nessie.

Best for: Organizations committed to open table formats (Iceberg) that want engine-agnostic access without a single governance chokepoint.

Loading diagram...

① Query engines connect to any catalog that implements the Iceberg REST API ② No single catalog "owns" the metadata — engines choose their catalog at query time ③ Governance must be implemented per-catalog or via a separate policy layer

🔵 Compute Engines · 🟢 Iceberg REST Catalogs

Pattern 3: Metadata Mesh (Decentralized with Contracts)

Each domain owns its catalog and publishes metadata contracts — schemas, SLAs, ownership, and quality metrics — to a shared registry. There's no central catalog; instead, a lightweight discovery service indexes published contracts.

Examples: Custom implementations using DataHub, Apache Atlas, or OpenMetadata as the contract registry.

Best for: Large enterprises with strong domain ownership where centralized governance is politically or technically infeasible.

Catalog Comparison: Federation Capabilities

Capability	Unity Catalog	AWS Glue	Dataplex (GCP)	Purview (Azure)	Apache Polaris	Apache Gravitino
Iceberg REST API	✅ Native	❌	❌	❌	✅ Native	✅ Native
Cross-cloud federation	✅ Lakehouse Fed.	❌ AWS only	❌ GCP only	🟡 Azure + Fabric	✅ Cloud-agnostic	✅ Cloud-agnostic
Foreign catalog support	✅ Glue, HMS, Snowflake	❌	🟡 BigQuery + GCS	🟡 Azure sources	N/A (is the catalog)	✅ Multiple backends
Open source	🟡 Partial (OSS version)	❌	❌	❌	✅ Apache 2.0	✅ Apache 2.0
Governance built-in	✅ RBAC, lineage	🟡 IAM-based	✅ Policy tags	✅ Classifications	🟡 Basic RBAC	🟡 Basic
Multi-format (Delta + Iceberg)	✅	❌ Delta	❌	❌	❌ Iceberg only	✅ Multi-format
Managed service available	✅ Databricks	✅ AWS	✅ GCP	✅ Azure	❌ Self-host	❌ Self-host

Last verified: April 2026 [PRICING-CHECK]

Decision Framework: When to Choose Which Pattern

Choosing a federation pattern isn't a technology decision — it's an organizational one. Here's the framework:

Choose Hub-and-Spoke When:

You have a dominant analytics platform (e.g., 70%+ of queries go through Databricks or Snowflake)
Centralized governance is a requirement (regulated industries, SOX compliance)
You want fast time-to-value — one team can implement this without cross-org coordination
You're willing to accept some vendor coupling at the metadata layer

Choose Protocol-Based Federation When:

You're committed to open table formats (Iceberg, Delta) and don't want catalog lock-in
You run multiple compute engines (Spark, Flink, Trino, Dremio) and need them all to see the same tables
You have the engineering capacity to self-host and operate catalog infrastructure
Engine interoperability matters more than unified governance

Choose Metadata Mesh When:

You operate at very large scale (100+ data domains, multiple business units)
Domain autonomy is non-negotiable — teams won't accept a centrally mandated catalog
You have mature data engineering teams in each domain capable of maintaining their own catalog
You're implementing data mesh and need catalog patterns that match your org structure

The Iceberg REST Catalog: The Federation Protocol That Changed Everything

The single biggest shift in catalog federation happened when the Apache Iceberg community standardized the REST Catalog API. Before this, every catalog spoke its own language. Now, a growing number of catalogs implement the same HTTP API, making cross-catalog interoperability a configuration change rather than an integration project.

Key Players in the Iceberg REST Ecosystem

Apache Polaris — Originally open-sourced by Snowflake, now an Apache project. It's a pure Iceberg catalog that implements the REST API specification. If your entire stack is Iceberg-native, Polaris is the most focused option. It doesn't try to be a governance platform — it's a catalog, full stop.

Apache Gravitino — Graduated to Apache Top-Level Project in May 2025. Gravitino takes a broader approach: it's a metadata federation layer that supports Iceberg, Hive, and multiple file formats. Think of it as a meta-catalog that can front multiple backend catalogs through a unified API. For organizations with heterogeneous storage, Gravitino is the most flexible open-source option.

Databricks Unity Catalog — The managed option. Unity Catalog natively exposes an Iceberg REST endpoint, meaning external engines (Trino, Flink, Dremio) can read Unity Catalog tables via standard Iceberg REST calls. Simultaneously, Lakehouse Federation lets Unity Catalog read from foreign catalogs (Glue, HMS, Snowflake Horizon). It's the most complete hub-and-spoke implementation available, but it ties your metadata control plane to Databricks.

Project Nessie — Adds Git-like branching and versioning to catalog operations. You can branch a catalog, experiment with schema changes, and merge back — useful for CI/CD workflows on data. Nessie implements the Iceberg REST API and is used by Dremio's lakehouse platform.

What the REST API Actually Standardizes

The Iceberg REST Catalog API defines endpoints for:

Namespace management — creating, listing, and deleting namespaces (databases)
Table operations — create, load, update, rename, drop tables
Metadata retrieval — fetching current table metadata (schema, partitioning, snapshots)
Commit protocol — atomic metadata updates with conflict detection

What it does not standardize: access control, lineage, data quality metrics, or business glossary terms. This is why protocol-based federation alone doesn't solve governance — you still need a policy layer on top.

Real-World Architecture: Multi-Cloud with Federated Catalogs

Here's a realistic architecture for an organization running workloads across AWS and Azure with a GCP analytics sandbox:

Loading diagram...

① Unity Catalog serves as the primary hub, pulling metadata from Glue and BigQuery via Lakehouse Federation ② Gravitino provides an engine-agnostic REST endpoint for non-Databricks engines (EMR Spark, Trino) ③ Databricks workspaces query through Unity Catalog natively ④ Data remains in its original cloud — S3, ADLS, GCS — only metadata crosses boundaries

🔵 Azure / Hub · 🟡 AWS · 🟢 GCP · 🟣 Federation Layer

This hybrid approach combines hub-and-spoke (Unity Catalog as the governance center) with protocol-based federation (Gravitino for engine flexibility). It's pragmatic: you get centralized governance where you need it and open access where you need that.

Governance in a Federated World

Federation without governance is just distributed chaos. Here's how to layer governance across federated catalogs:

Access Control Strategy

Layer	Mechanism	Example
Cloud IAM	Cloud-native identity	AWS IAM roles, Azure AD, GCP IAM
Catalog RBAC	Table/schema-level permissions	Unity Catalog grants, Polaris RBAC
Column-level security	Masking and filtering	Dynamic views, Purview sensitivity labels
Row-level security	Predicate-based filtering	Unity Catalog row filters, BigQuery row policies
Cross-catalog policy	Centralized policy engine	Immuta, Privacera, or Open Policy Agent

The key insight: in a federated model, you can't rely on a single catalog's RBAC. You need either a policy engine that sits above all catalogs (Immuta, Privacera) or you need to replicate policies across catalogs — which is its own maintenance burden.

Lineage Across Catalogs

Cross-catalog lineage is the hardest problem in federation. Within a single catalog (Unity Catalog, Purview), lineage is tracked automatically. Across catalogs, you need one of:

OpenLineage — The emerging standard. Spark, Airflow, dbt, and Flink can emit OpenLineage events that a central collector (Marquez, Atlan, DataHub) aggregates into a unified lineage graph
Manual registration — Teams document cross-catalog dependencies in a shared registry. This doesn't scale, but it's honest about the current state of tooling
Vendor solutions — Atlan, Collibra, and Alation all offer multi-source lineage, but require connectors to each catalog. Pricing starts at $60k+/year for enterprise tiers [PRICING-CHECK]

Common Pitfalls and How to Avoid Them

Pitfall 1: Treating federation as a technology problem. Federation is an organizational pattern. If teams don't agree on naming conventions, ownership, and metadata standards, no technology will save you. Start with a metadata contract: what fields are required, who owns what, what SLAs apply.

Pitfall 2: Over-federating too early. If you're on one cloud with one analytics platform, you don't need federation — you need a catalog. Federation adds complexity that's only justified when you genuinely have multiple catalogs that must coexist.

Pitfall 3: Ignoring staleness. Federated metadata is eventually consistent. When catalog A updates a schema and catalog B still shows the old version, users lose trust. Define your staleness SLA (minutes? hours?) and monitor it.

Pitfall 4: Assuming Iceberg REST solves governance. The Iceberg REST API is a table management protocol, not a governance framework. You still need access control, lineage, and data quality — the REST API doesn't provide these.

Pitfall 5: Choosing open-source catalogs without operational readiness. Polaris and Gravitino are powerful but require self-hosting. If your team doesn't have the capacity to run, monitor, and upgrade a catalog service, a managed option (Unity Catalog, Glue, Purview) is the safer bet — even with the vendor coupling trade-off.

What's Coming Next

The catalog federation landscape is moving fast. Several trends to watch:

Microsoft OneLake + Iceberg REST — Microsoft announced support for Iceberg REST and Unity Catalog Open APIs in OneLake, which means Fabric data could become accessible to any Iceberg-compatible engine. This blurs the line between Azure-native and open catalog approaches
Gravitino as the meta-catalog — With Apache TLP status (May 2025), Gravitino is positioned to become the default open-source federation layer for heterogeneous environments
OpenLineage adoption — As more engines emit OpenLineage events natively, cross-catalog lineage goes from aspirational to practical
AI-driven metadata management — Catalogs are adding LLM-powered search, auto-tagging, and schema suggestion. This matters for federation because it can automatically reconcile metadata across catalogs with different naming conventions

If you're exploring how data scattered across APIs, CSVs, and cloud sources fits into a unified view, Harbinger Explorer offers a lightweight starting point. Its browser-based DuckDB engine lets you query multiple data sources with natural language — no catalog infrastructure required. It won't replace Unity Catalog for enterprise governance, but for rapid exploration across sources, it removes the setup friction.

Your Next Move

Start with an honest inventory: how many catalogs do you actually have today? Count Glue, HMS instances, Purview, Dataplex, and any internal metadata stores. If the answer is one, invest in that catalog. If it's three or more, federation isn't optional — it's already happening, just without a strategy.

Pick the pattern that matches your organization, not your technology preferences. Hub-and-spoke for centralized teams, protocol-based for multi-engine shops, metadata mesh for large federated organizations. Then implement incrementally — connect two catalogs first, prove the pattern, and expand.

The goal isn't one catalog to rule them all. It's one way to find anything, everywhere.

Continue Reading

cloud-architecture14 min

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.

Apr 3, 2026Read

cloud-architecture11 min

Cloud Cost Allocation Strategies for Data Teams

A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.

Apr 3, 2026Read

cloud-architecture13 min

API Gateway Architecture Patterns for Data Platforms

A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.

Apr 3, 2026Read

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Why Single-Catalog Strategies Fail at Scale

The Three Federation Patterns

Pattern 1: Hub-and-Spoke (Central Aggregator)

Pattern 2: Protocol-Based Federation (Iceberg REST API)

Pattern 3: Metadata Mesh (Decentralized with Contracts)

Catalog Comparison: Federation Capabilities

Decision Framework: When to Choose Which Pattern

Choose Hub-and-Spoke When:

Choose Protocol-Based Federation When:

Choose Metadata Mesh When:

The Iceberg REST Catalog: The Federation Protocol That Changed Everything

Key Players in the Iceberg REST Ecosystem

What the REST API Actually Standardizes

Real-World Architecture: Multi-Cloud with Federated Catalogs

Governance in a Federated World

Access Control Strategy

Lineage Across Catalogs

Common Pitfalls and How to Avoid Them

What's Coming Next

Your Next Move

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette