Harbinger Explorer

Back to Knowledge Hub
Engineering
Published:

Data Observability Explained: Freshness, Volume, Schema

10 min read·Tags: data observability, data quality, freshness, schema monitoring, data lineage, monitoring, data engineering

Data Observability Explained: Freshness, Volume, Schema, Lineage

The dashboard shows $0 in revenue for the last 48 hours. The pipeline didn't fail. No alerts fired. A WHERE clause silently started dropping all rows due to a null value introduced upstream. Data observability exists to catch exactly this kind of failure.

Observability in software engineering means you can understand the internal state of a system from its external outputs. Applied to data, it means you can answer: Is my data complete? Fresh? Correct? Did anything change unexpectedly?

The Five Pillars of Data Observability

Most frameworks for data observability converge on five core dimensions:

PillarQuestion it answersExample failure it catches
FreshnessWas the data updated recently?Pipeline ran but loaded 0 rows; silent delay
VolumeIs there the expected amount of data?Table lost 30% of rows overnight
SchemaDid the structure change unexpectedly?Column renamed, type changed, column dropped
DistributionAre values statistically normal?Null rate jumped from 0.1% to 40%
LineageWhere does data come from and what depends on it?Upstream table change broke 12 downstream models

Data testing (dbt tests, Great Expectations) covers specific assertions you define in advance. Data observability goes further — it monitors continuously and detects anomalies you didn't anticipate.


Freshness Monitoring

Freshness is the simplest pillar to implement and often the most valuable. If a table isn't updated when expected, something upstream broke.

-- PostgreSQL: freshness check as a scheduled query
-- Alert if the orders table hasn't been updated in the last 2 hours
SELECT
    CASE
        WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
        THEN 'STALE'
        ELSE 'FRESH'
    END AS freshness_status,
    MAX(updated_at) AS last_update,
    NOW() - MAX(updated_at) AS data_age
FROM warehouse.orders;

In dbt, freshness is configured at the source level:

# dbt: sources.yml with freshness SLA
sources:
  - name: raw
    schema: raw_data
    tables:
      - name: orders
        freshness:
          warn_after:
            count: 1
            period: hour
          error_after:
            count: 3
            period: hour
        loaded_at_field: updated_at  # column to check

Running dbt source freshness compares the latest updated_at against your SLA thresholds and raises warnings or errors accordingly.

Common freshness anti-pattern: Checking freshness only at the source. A pipeline can run successfully but produce an empty target if a filter or join kills all rows. Check freshness at the target table, not just the source.


Volume Monitoring

Volume monitoring detects unexpected drops or spikes in row counts. A sudden 0-row load almost never means the source had no data — it usually means something broke.

# Python: volume anomaly detection using historical average
import sqlalchemy as sa

engine = sa.create_engine("postgresql://...")

def check_volume_anomaly(table: str, date: str, threshold: float = 0.3) -> dict:
    # Compare today's row count to the 7-day average.
    # Flag if deviation exceeds threshold (default 30%).
    with engine.connect() as conn:
        result = conn.execute(sa.text(
            "WITH daily_counts AS ("
            "    SELECT DATE(loaded_at) AS load_date, COUNT(*) AS row_count"
            f"   FROM {table}"
            "    WHERE loaded_at >= NOW() - INTERVAL '8 days'"
            "    GROUP BY DATE(loaded_at)"
            "),"
            "baseline AS ("
            "    SELECT AVG(row_count) AS avg_count FROM daily_counts"
            "    WHERE load_date < :check_date"
            "),"
            "today AS ("
            "    SELECT row_count AS today_count FROM daily_counts"
            "    WHERE load_date = :check_date"
            ")"
            "SELECT today.today_count, baseline.avg_count,"
            "    ABS(today.today_count - baseline.avg_count) / NULLIF(baseline.avg_count, 0) AS deviation"
            "FROM today, baseline"
        ), {"check_date": date}).fetchone()

    deviation = float(result.deviation or 0)
    return {
        "table": table,
        "today_count": result.today_count,
        "avg_count": result.avg_count,
        "deviation_pct": round(deviation * 100, 1),
        "anomaly": deviation > threshold
    }

check = check_volume_anomaly("warehouse.orders", "2024-03-15")
if check["anomaly"]:
    print(f"Volume anomaly: {check['deviation_pct']}% deviation from 7-day average")

Volume monitoring works best as a relative check (deviation from historical baseline), not an absolute threshold. Your order volume on Black Friday is 10x a normal Tuesday. A fixed threshold would fire every peak day.


Schema Monitoring

Schema changes are a silent killer. A column gets renamed in the source system. A new NOT NULL constraint gets added. A type changes from VARCHAR to INT. None of these cause an error in the pipeline — they just corrupt downstream outputs.

# Python: schema snapshot comparison
import json
import sqlalchemy as sa

def snapshot_schema(engine, table: str) -> dict:
    # Capture current schema as a dict for comparison
    with engine.connect() as conn:
        columns = conn.execute(sa.text(
            "SELECT column_name, data_type, is_nullable, character_maximum_length"
            " FROM information_schema.columns"
            " WHERE table_name = :table"
            " ORDER BY ordinal_position"
        ), {"table": table}).fetchall()
    return {row.column_name: dict(row._mapping) for row in columns}

def compare_schemas(old: dict, new: dict) -> list:
    # Return list of detected schema changes
    changes = []
    for col in old:
        if col not in new:
            changes.append(f"DROPPED: column '{col}'")
        elif old[col]["data_type"] != new[col]["data_type"]:
            changes.append(
                f"TYPE CHANGE: '{col}' {old[col]['data_type']} -> {new[col]['data_type']}"
            )
    for col in new:
        if col not in old:
            changes.append(f"ADDED: column '{col}'")
    return changes

# Usage: compare yesterday's snapshot to today's
engine = sa.create_engine("postgresql://...")
old_schema = json.load(open("schema_snapshots/orders_yesterday.json"))
new_schema = snapshot_schema(engine, "orders")

changes = compare_schemas(old_schema, new_schema)
if changes:
    print("Schema changes detected:")
    for c in changes:
        print(f"  - {c}")

Store schema snapshots as files or in a metadata table. Run comparisons on every pipeline run. Alert on any unexpected change — expected changes (new columns added intentionally) can be acknowledged and suppressed.


Distribution Monitoring

Distribution monitoring tracks statistical properties of column values over time. Null rates, distinct counts, min/max values, and mean/median — and alerts when they deviate significantly from historical norms.

-- PostgreSQL: distribution profile for a column
-- Run this on each pipeline execution and store results
SELECT
    'orders' AS table_name,
    'amount' AS column_name,
    NOW() AS profiled_at,
    COUNT(*) AS total_rows,
    COUNT(amount) AS non_null_count,
    COUNT(*) - COUNT(amount) AS null_count,
    ROUND(100.0 * (COUNT(*) - COUNT(amount)) / COUNT(*), 2) AS null_pct,
    MIN(amount) AS min_val,
    MAX(amount) AS max_val,
    AVG(amount) AS mean_val,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_val,
    STDDEV(amount) AS stddev_val
FROM warehouse.orders
WHERE order_date = CURRENT_DATE;

The power comes from storing these profiles over time and comparing each run's statistics to the historical baseline. A null rate that jumps from 0.2% to 15% is a strong signal that something upstream changed — even if the row count looks normal.


Lineage Monitoring

Data lineage answers two questions: Where did this data come from? and What depends on this data?

When an upstream column changes, lineage tells you which downstream models, reports, and dashboards are affected. Without it, you're debugging blind.

Loading diagram...

raw.orders and raw.customers are source tables — changes here propagate downstream. ② stg_orders and stg_customers are the first transformation layer. ③ fct_orders joins both staging tables — a breaking change in either source affects this. ④ report_executive_kpis depends on three upstream models — it's the most fragile endpoint.

dbt generates lineage automatically from ref() and source() calls. For non-dbt pipelines, tools like OpenLineage, Marquez, or DataHub provide lineage capture via instrumentation.


Tooling Landscape

ToolFreshnessVolumeSchemaDistributionLineageHosted SaaS
Monte Carlo
Soda CloudPartial
dbt + Elementary✅ (via dbt)✅ (Elementary)
Great ExpectationsPartialPartialSelf-hosted
OpenLineage/MarquezSelf-hosted
Re:dataPartialPartialSelf-hosted

For most teams starting out: dbt with Elementary covers all five pillars without a separate tool. For larger teams with multi-engine data stacks: Monte Carlo or a similar dedicated platform is worth the investment.


Building a Minimal Observability Stack

If you're starting from scratch and want coverage without tool sprawl:

  1. Freshness → dbt source freshness (free, built-in)
  2. Volume → dbt test row_count > 0 + custom singular tests with deviation logic
  3. Schema → dbt schema tests (not_null, accepted_values) + schema snapshot comparison script
  4. Distribution → Elementary (open source dbt package) profiles columns automatically
  5. Lineage → dbt docs generate (free, built from ref() calls)

This stack requires no additional infrastructure, costs nothing, and covers the most common failure modes.


Wrapping Up

Data observability is what happens when you stop treating data pipelines as batch jobs that either succeed or fail, and start treating them as systems with internal state you need to understand continuously. Freshness, volume, schema, distribution, and lineage together give you that understanding.

The hardest part isn't building the monitors — it's building the discipline to act on them. An alert that nobody investigates is just noise. Define owners, set SLAs, and treat data quality incidents with the same urgency as production outages.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...