Data Catalog Federation Across Cloud Platforms
Every platform team eventually hits the same wall: your data lives in three clouds, two on-prem systems, and a SaaS vendor's API — and nobody can find anything. You have catalogs, plural. What you don't have is a catalog.
Data catalog federation is the architectural pattern that solves this without forcing a rip-and-replace migration. Instead of consolidating all metadata into one tool, you connect multiple catalogs through shared protocols, creating a unified discovery layer while letting each domain keep its own catalog autonomy.
This article breaks down the federation patterns that actually work, compares the major catalog options, and gives you a decision framework for choosing the right architecture.
Why Single-Catalog Strategies Fail at Scale
The instinct is understandable: pick one catalog, make everyone use it, declare victory. In practice, this fails for three reasons:
- Acquisitions and mergers bring entirely new data stacks overnight
- Multi-cloud strategies mean each cloud provider's managed services already have their own catalog (AWS Glue, Google Dataplex, Azure Purview)
- Domain autonomy — data mesh or not, different teams choose different tools
Forcing consolidation creates a metadata migration project that's just as painful as a data migration. Federation sidesteps this by treating catalogs as peers in a network rather than competitors in a winner-take-all contest.
The Three Federation Patterns
Not all federation is created equal. The pattern you choose determines your complexity budget, governance model, and vendor flexibility.
Pattern 1: Hub-and-Spoke (Central Aggregator)
One catalog serves as the primary discovery layer and pulls metadata from satellite catalogs. Users search one place; governance policies are enforced centrally.
Examples: Databricks Unity Catalog with Lakehouse Federation, Atlan connecting to multiple sources, Collibra as governance hub.
Best for: Organizations with a clear primary analytics platform that want unified governance without migrating data.
Loading diagram...
① Satellite catalogs (Glue, Purview, Dataplex, HMS) maintain their own metadata independently ② The central hub pulls metadata on a schedule or via change events ③ Users query the hub for discovery; governance policies apply at the hub layer ④ Data stays in place — only metadata moves
🔵 Central Hub · 🟡 AWS · 🟣 Azure · 🟢 GCP · 🔴 Legacy
Pattern 2: Protocol-Based Federation (Iceberg REST API)
Instead of a central aggregator, catalogs expose a shared protocol — specifically the Iceberg REST Catalog API. Any engine that speaks the protocol can discover and access tables from any compatible catalog.
Examples: Apache Polaris, Apache Gravitino, Unity Catalog (as Iceberg REST endpoint), Project Nessie.
Best for: Organizations committed to open table formats (Iceberg) that want engine-agnostic access without a single governance chokepoint.
Loading diagram...
① Query engines connect to any catalog that implements the Iceberg REST API ② No single catalog "owns" the metadata — engines choose their catalog at query time ③ Governance must be implemented per-catalog or via a separate policy layer
🔵 Compute Engines · 🟢 Iceberg REST Catalogs
Pattern 3: Metadata Mesh (Decentralized with Contracts)
Each domain owns its catalog and publishes metadata contracts — schemas, SLAs, ownership, and quality metrics — to a shared registry. There's no central catalog; instead, a lightweight discovery service indexes published contracts.
Examples: Custom implementations using DataHub, Apache Atlas, or OpenMetadata as the contract registry.
Best for: Large enterprises with strong domain ownership where centralized governance is politically or technically infeasible.
Catalog Comparison: Federation Capabilities
| Capability | Unity Catalog | AWS Glue | Dataplex (GCP) | Purview (Azure) | Apache Polaris | Apache Gravitino |
|---|---|---|---|---|---|---|
| Iceberg REST API | ✅ Native | ❌ | ❌ | ❌ | ✅ Native | ✅ Native |
| Cross-cloud federation | ✅ Lakehouse Fed. | ❌ AWS only | ❌ GCP only | 🟡 Azure + Fabric | ✅ Cloud-agnostic | ✅ Cloud-agnostic |
| Foreign catalog support | ✅ Glue, HMS, Snowflake | ❌ | 🟡 BigQuery + GCS | 🟡 Azure sources | N/A (is the catalog) | ✅ Multiple backends |
| Open source | 🟡 Partial (OSS version) | ❌ | ❌ | ❌ | ✅ Apache 2.0 | ✅ Apache 2.0 |
| Governance built-in | ✅ RBAC, lineage | 🟡 IAM-based | ✅ Policy tags | ✅ Classifications | 🟡 Basic RBAC | 🟡 Basic |
| Multi-format (Delta + Iceberg) | ✅ | ❌ Delta | ❌ | ❌ | ❌ Iceberg only | ✅ Multi-format |
| Managed service available | ✅ Databricks | ✅ AWS | ✅ GCP | ✅ Azure | ❌ Self-host | ❌ Self-host |
Last verified: April 2026 [PRICING-CHECK]
Decision Framework: When to Choose Which Pattern
Choosing a federation pattern isn't a technology decision — it's an organizational one. Here's the framework:
Choose Hub-and-Spoke When:
- You have a dominant analytics platform (e.g., 70%+ of queries go through Databricks or Snowflake)
- Centralized governance is a requirement (regulated industries, SOX compliance)
- You want fast time-to-value — one team can implement this without cross-org coordination
- You're willing to accept some vendor coupling at the metadata layer
Choose Protocol-Based Federation When:
- You're committed to open table formats (Iceberg, Delta) and don't want catalog lock-in
- You run multiple compute engines (Spark, Flink, Trino, Dremio) and need them all to see the same tables
- You have the engineering capacity to self-host and operate catalog infrastructure
- Engine interoperability matters more than unified governance
Choose Metadata Mesh When:
- You operate at very large scale (100+ data domains, multiple business units)
- Domain autonomy is non-negotiable — teams won't accept a centrally mandated catalog
- You have mature data engineering teams in each domain capable of maintaining their own catalog
- You're implementing data mesh and need catalog patterns that match your org structure
The Iceberg REST Catalog: The Federation Protocol That Changed Everything
The single biggest shift in catalog federation happened when the Apache Iceberg community standardized the REST Catalog API. Before this, every catalog spoke its own language. Now, a growing number of catalogs implement the same HTTP API, making cross-catalog interoperability a configuration change rather than an integration project.
Key Players in the Iceberg REST Ecosystem
Apache Polaris — Originally open-sourced by Snowflake, now an Apache project. It's a pure Iceberg catalog that implements the REST API specification. If your entire stack is Iceberg-native, Polaris is the most focused option. It doesn't try to be a governance platform — it's a catalog, full stop.
Apache Gravitino — Graduated to Apache Top-Level Project in May 2025. Gravitino takes a broader approach: it's a metadata federation layer that supports Iceberg, Hive, and multiple file formats. Think of it as a meta-catalog that can front multiple backend catalogs through a unified API. For organizations with heterogeneous storage, Gravitino is the most flexible open-source option.
Databricks Unity Catalog — The managed option. Unity Catalog natively exposes an Iceberg REST endpoint, meaning external engines (Trino, Flink, Dremio) can read Unity Catalog tables via standard Iceberg REST calls. Simultaneously, Lakehouse Federation lets Unity Catalog read from foreign catalogs (Glue, HMS, Snowflake Horizon). It's the most complete hub-and-spoke implementation available, but it ties your metadata control plane to Databricks.
Project Nessie — Adds Git-like branching and versioning to catalog operations. You can branch a catalog, experiment with schema changes, and merge back — useful for CI/CD workflows on data. Nessie implements the Iceberg REST API and is used by Dremio's lakehouse platform.
What the REST API Actually Standardizes
The Iceberg REST Catalog API defines endpoints for:
- Namespace management — creating, listing, and deleting namespaces (databases)
- Table operations — create, load, update, rename, drop tables
- Metadata retrieval — fetching current table metadata (schema, partitioning, snapshots)
- Commit protocol — atomic metadata updates with conflict detection
What it does not standardize: access control, lineage, data quality metrics, or business glossary terms. This is why protocol-based federation alone doesn't solve governance — you still need a policy layer on top.
Real-World Architecture: Multi-Cloud with Federated Catalogs
Here's a realistic architecture for an organization running workloads across AWS and Azure with a GCP analytics sandbox:
Loading diagram...
① Unity Catalog serves as the primary hub, pulling metadata from Glue and BigQuery via Lakehouse Federation ② Gravitino provides an engine-agnostic REST endpoint for non-Databricks engines (EMR Spark, Trino) ③ Databricks workspaces query through Unity Catalog natively ④ Data remains in its original cloud — S3, ADLS, GCS — only metadata crosses boundaries
🔵 Azure / Hub · 🟡 AWS · 🟢 GCP · 🟣 Federation Layer
This hybrid approach combines hub-and-spoke (Unity Catalog as the governance center) with protocol-based federation (Gravitino for engine flexibility). It's pragmatic: you get centralized governance where you need it and open access where you need that.
Governance in a Federated World
Federation without governance is just distributed chaos. Here's how to layer governance across federated catalogs:
Access Control Strategy
| Layer | Mechanism | Example |
|---|---|---|
| Cloud IAM | Cloud-native identity | AWS IAM roles, Azure AD, GCP IAM |
| Catalog RBAC | Table/schema-level permissions | Unity Catalog grants, Polaris RBAC |
| Column-level security | Masking and filtering | Dynamic views, Purview sensitivity labels |
| Row-level security | Predicate-based filtering | Unity Catalog row filters, BigQuery row policies |
| Cross-catalog policy | Centralized policy engine | Immuta, Privacera, or Open Policy Agent |
The key insight: in a federated model, you can't rely on a single catalog's RBAC. You need either a policy engine that sits above all catalogs (Immuta, Privacera) or you need to replicate policies across catalogs — which is its own maintenance burden.
Lineage Across Catalogs
Cross-catalog lineage is the hardest problem in federation. Within a single catalog (Unity Catalog, Purview), lineage is tracked automatically. Across catalogs, you need one of:
- OpenLineage — The emerging standard. Spark, Airflow, dbt, and Flink can emit OpenLineage events that a central collector (Marquez, Atlan, DataHub) aggregates into a unified lineage graph
- Manual registration — Teams document cross-catalog dependencies in a shared registry. This doesn't scale, but it's honest about the current state of tooling
- Vendor solutions — Atlan, Collibra, and Alation all offer multi-source lineage, but require connectors to each catalog. Pricing starts at $60k+/year for enterprise tiers [PRICING-CHECK]
Common Pitfalls and How to Avoid Them
Pitfall 1: Treating federation as a technology problem. Federation is an organizational pattern. If teams don't agree on naming conventions, ownership, and metadata standards, no technology will save you. Start with a metadata contract: what fields are required, who owns what, what SLAs apply.
Pitfall 2: Over-federating too early. If you're on one cloud with one analytics platform, you don't need federation — you need a catalog. Federation adds complexity that's only justified when you genuinely have multiple catalogs that must coexist.
Pitfall 3: Ignoring staleness. Federated metadata is eventually consistent. When catalog A updates a schema and catalog B still shows the old version, users lose trust. Define your staleness SLA (minutes? hours?) and monitor it.
Pitfall 4: Assuming Iceberg REST solves governance. The Iceberg REST API is a table management protocol, not a governance framework. You still need access control, lineage, and data quality — the REST API doesn't provide these.
Pitfall 5: Choosing open-source catalogs without operational readiness. Polaris and Gravitino are powerful but require self-hosting. If your team doesn't have the capacity to run, monitor, and upgrade a catalog service, a managed option (Unity Catalog, Glue, Purview) is the safer bet — even with the vendor coupling trade-off.
What's Coming Next
The catalog federation landscape is moving fast. Several trends to watch:
- Microsoft OneLake + Iceberg REST — Microsoft announced support for Iceberg REST and Unity Catalog Open APIs in OneLake, which means Fabric data could become accessible to any Iceberg-compatible engine. This blurs the line between Azure-native and open catalog approaches
- Gravitino as the meta-catalog — With Apache TLP status (May 2025), Gravitino is positioned to become the default open-source federation layer for heterogeneous environments
- OpenLineage adoption — As more engines emit OpenLineage events natively, cross-catalog lineage goes from aspirational to practical
- AI-driven metadata management — Catalogs are adding LLM-powered search, auto-tagging, and schema suggestion. This matters for federation because it can automatically reconcile metadata across catalogs with different naming conventions
If you're exploring how data scattered across APIs, CSVs, and cloud sources fits into a unified view, Harbinger Explorer offers a lightweight starting point. Its browser-based DuckDB engine lets you query multiple data sources with natural language — no catalog infrastructure required. It won't replace Unity Catalog for enterprise governance, but for rapid exploration across sources, it removes the setup friction.
Your Next Move
Start with an honest inventory: how many catalogs do you actually have today? Count Glue, HMS instances, Purview, Dataplex, and any internal metadata stores. If the answer is one, invest in that catalog. If it's three or more, federation isn't optional — it's already happening, just without a strategy.
Pick the pattern that matches your organization, not your technology preferences. Hub-and-spoke for centralized teams, protocol-based for multi-engine shops, metadata mesh for large federated organizations. Then implement incrementally — connect two catalogs first, prove the pattern, and expand.
The goal isn't one catalog to rule them all. It's one way to find anything, everywhere.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial