Harbinger Explorer

Back to Knowledge Hub
Engineering
Published:

Data Lineage Tracking: Why It Matters and How to Implement It

9 min read·Tags: data lineage, openlineage, marquez, unity catalog, data governance, data engineering, observability

A KPI dashboard shows revenue down 18% for the previous quarter. Finance is alarmed. You need to know: is this real, or is a pipeline broken? Without data lineage, you're manually tracing SQL through four transformation layers, three source systems, and a dbt project with 200 models. With lineage, you click twice and find the broken upstream join.

Data lineage — the ability to trace data from its source to its destination — is the difference between a data platform that people trust and one that generates endless "is this data correct?" Slack threads.


What Data Lineage Actually Is

Data lineage answers three questions:

  1. Origin: Where did this data come from? Which source system, which table, which field?
  2. Transformation: What happened to it? Which pipelines, queries, and models touched it?
  3. Impact: Who consumes it? Which dashboards, models, and downstream pipelines depend on this table?

Lineage exists at two granularities:

LevelDescriptionExample
Table/Dataset lineageWhich tables feed which tablesorders_raw -> orders_cleaned -> fact_orders
Column-level lineageWhich source columns map to which output columnsorders.gross_amount -> fact_orders.revenue

Column-level lineage is significantly harder to capture — most tools do table-level well, column-level inconsistently.


Why Lineage Matters

1. Incident Response

When a metric breaks, lineage lets you trace upstream to find the root cause. Without it, you're grep-ing through SQL files and asking Slack who owns the dim_customer table.

2. Impact Analysis

Before dropping a deprecated column or changing a transformation, you need to know who's downstream. Lineage gives you that dependency graph automatically.

3. Regulatory Compliance

GDPR, CCPA, HIPAA, and SOX all have some variant of "prove that you know where personal data lives and flows." Data lineage is a core audit evidence artifact.

4. Data Quality Debugging

Bad data quality can originate three layers upstream. Lineage lets you trace quality issues to their source — not just catch them at the surface.


The OpenLineage Standard

OpenLineage is an open standard (CNCF incubating project) for capturing lineage events from data pipelines. Instead of each tool having its own lineage API, OpenLineage defines a common JSON schema for "a job ran and consumed/produced these datasets."

{
  "eventType": "COMPLETE",
  "eventTime": "2024-03-15T14:30:00Z",
  "job": {
    "namespace": "my_airflow",
    "name": "orders_pipeline.transform_orders"
  },
  "inputs": [
    {
      "namespace": "postgres://prod-db",
      "name": "public.orders_raw",
      "facets": {
        "schema": {
          "fields": [
            {"name": "order_id", "type": "VARCHAR"},
            {"name": "amount",   "type": "NUMERIC"}
          ]
        }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "s3://datalake",
      "name": "orders/cleaned",
      "facets": {
        "schema": {
          "fields": [
            {"name": "order_id",   "type": "VARCHAR"},
            {"name": "net_amount", "type": "DOUBLE"}
          ]
        }
      }
    }
  ]
}

OpenLineage events are emitted by integrations — libraries or plugins that hook into your existing tools (Airflow, Spark, dbt, Flink) and emit events automatically.

# Python — OpenLineage client: emit a custom lineage event
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import SchemaDatasetFacet, SchemaField
from openlineage.client.dataset import Dataset
import datetime
import uuid

client = OpenLineageClient.from_environment()  # reads OPENLINEAGE_URL + OPENLINEAGE_API_KEY

run_id = str(uuid.uuid4())

event = RunEvent(
    eventType=RunState.COMPLETE,
    eventTime=datetime.datetime.utcnow().isoformat() + "Z",
    run=Run(runId=run_id),
    job=Job(namespace="custom_pipeline", name="enrich_customers"),
    inputs=[
        Dataset(
            namespace="postgres://prod",
            name="public.customers",
            facets={"schema": SchemaDatasetFacet(fields=[
                SchemaField("customer_id", "VARCHAR"),
                SchemaField("email", "VARCHAR"),
            ])}
        )
    ],
    outputs=[
        Dataset(
            namespace="s3://datalake",
            name="customers/enriched",
            facets={"schema": SchemaDatasetFacet(fields=[
                SchemaField("customer_id", "VARCHAR"),
                SchemaField("email", "VARCHAR"),
                SchemaField("geo_country", "VARCHAR"),  # enriched column
            ])}
        )
    ],
    producer="custom_pipeline/v1.0"
)

client.emit(event)

Tools Compared

Marquez

Marquez is the open-source reference implementation for consuming OpenLineage events. It's a metadata server with a UI that visualizes the lineage graph.

Strengths:

  • Open source, self-hosted
  • Native OpenLineage consumer
  • Simple REST API for querying lineage
  • Good for teams that want full control over their lineage backend

Weaknesses:

  • UI is functional, not polished
  • No enterprise features (RBAC, SSO) out of the box
  • Requires operational maintenance

Marquez is the right choice for teams fully committed to open source and willing to self-host.

Unity Catalog (Databricks)

Unity Catalog's System Tables include automatic lineage capture for Databricks workloads. Any query run against Unity Catalog-managed tables is automatically tracked — no instrumentation required.

-- Spark SQL / Databricks SQL — query Unity Catalog lineage system tables
SELECT
    source_table_full_name,
    target_table_full_name,
    created_by,
    event_time
FROM system.access.table_lineage
WHERE target_table_full_name = 'main.analytics.fact_orders'
ORDER BY event_time DESC
LIMIT 20;

Strengths:

  • Zero-instrumentation — works automatically for all Databricks SQL and Spark workloads
  • Column-level lineage available (in newer releases)
  • Integrated with data governance, access control, and catalog search
  • Enterprise-grade (SSO, RBAC, audit logs)

Weaknesses:

  • Databricks-only — doesn't capture lineage from external tools
  • Cost (Unity Catalog is a paid Databricks feature)
  • Vendor lock-in

Unity Catalog lineage is the obvious choice if you're already on Databricks and don't need to track pipelines outside the platform.

dbt Lineage

dbt generates a lineage DAG automatically from ref() and source() calls in your SQL models. It's limited to dbt-managed transformations, but within that scope it's excellent.

# dbt — model dependencies are inferred from SQL refs
# models/marts/fact_orders.sql uses:
# SELECT * FROM {{ ref('stg_orders') }}
# JOIN {{ ref('dim_customers') }} ON ...

# dbt docs generate -> lineage graph available in dbt docs serve

The dbt docs generate command creates an interactive lineage graph in the dbt Docs UI. Combined with dbt artifacts pushed to your catalog, this gives good SQL-layer lineage.

Tool Comparison

ToolScopeColumn LineageSelf-HostedCost
MarquezOpenLineage-emitting pipelinesPartialRequiredFree
Unity CatalogDatabricks workloadsYesSaaSDatabricks pricing
dbt lineagedbt models onlyNoOptionalFree (OSS dbt)
Apache AtlasHadoop ecosystemPartialRequiredFree
DataHubBroad (many integrations)PartialOptionalFree / Enterprise
Atlan / AlationEnterprise catalogYesSaaSPaid

Implementing Lineage in Practice

Step 1: Choose Your Backend

For most teams, the choice is:

  • Databricks users: Enable Unity Catalog — lineage is free for Spark/SQL workloads
  • Multi-platform teams: OpenLineage + Marquez (or DataHub as a more mature alternative)
  • dbt-heavy teams: Start with dbt docs; add OpenLineage when you need cross-system lineage

Step 2: Instrument Your Pipelines

Airflow, Spark, and dbt all have OpenLineage integrations that emit events automatically:

# Airflow — enable OpenLineage (openlineage-airflow package)
# In airflow.cfg or environment variables:
# OPENLINEAGE_URL=http://marquez:5000
# OPENLINEAGE_NAMESPACE=production_airflow
# The openlineage-airflow provider auto-instruments all DAG runs

For Spark, the openlineage-spark library instruments SparkContext events:

# PySpark — OpenLineage Spark integration
spark = SparkSession.builder \
    .appName("my_pipeline") \
    .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \
    .config("spark.openlineage.transport.type", "http") \
    .config("spark.openlineage.transport.url", "http://marquez:5000") \
    .config("spark.openlineage.namespace", "production") \
    .getOrCreate()
# All DataFrame reads/writes now emit lineage events automatically

Step 3: Query Lineage Programmatically

Once events are flowing, use the Marquez API or Unity Catalog system tables to query lineage in automated impact analysis:

# Python — query Marquez REST API for upstream lineage
import requests

def get_upstream_datasets(namespace: str, dataset_name: str):
    url = f"http://marquez:5000/api/v1/lineage?nodeId=dataset:{namespace}:{dataset_name}"
    response = requests.get(url)
    response.raise_for_status()
    graph = response.json()
    return [
        edge["origin"]["id"]
        for edge in graph.get("graph", [])
        if edge["type"] == "EDGE" and edge["destination"]["id"] == f"dataset:{namespace}:{dataset_name}"
    ]

upstream = get_upstream_datasets("s3://datalake", "fact_orders")
print(upstream)
# ['dataset:postgres://prod:public.orders_raw', 'dataset:s3://datalake:customers/enriched']

Common Pitfalls

1. Lineage as an afterthought. Lineage is cheapest to add when pipelines are first built. Retrofitting it onto a 200-model dbt project or a legacy Airflow environment takes weeks. Plan for it upfront.

2. Table lineage without column lineage. Knowing that table_A feeds table_B is useful. Knowing which column in A maps to which column in B is what you actually need for GDPR data subject requests and quality debugging.

3. Treating lineage as an ETL problem only. Lineage needs to cover BI tools (Tableau, Looker), notebooks, and ML pipelines — not just ETL. A broken dashboard that has no lineage back to a source table is just as blind as a broken pipeline.

4. No lineage for ad-hoc queries. Most teams track scheduled pipeline lineage but ignore analyst queries. Unity Catalog and tools like DataHub can capture ad-hoc SQL lineage from connected BI tools.


The Bottom Line

Data lineage is infrastructure, not a nice-to-have. The cost of not having it is measured in hours of debugging time, failed audits, and broken trust in data. The cost of adding it — especially with OpenLineage auto-instrumentation — is lower than most teams expect.

Start with your existing tools: dbt docs for the transformation layer, Unity Catalog if you're on Databricks, or Marquez/DataHub for multi-platform environments. Column-level lineage can come later — table-level lineage alone will dramatically improve your incident response time.

Next step: Combine lineage with proactive quality monitoring — read Data Pipeline Monitoring: What to Track to close the loop between lineage and observability.


Continue Reading


[VERIFY]: Unity Catalog column lineage GA status and version requirements. [VERIFY]: OpenLineage CNCF project status (incubating vs graduated).


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...