EngineeringApr 3, 2026

Data Contracts for Teams

9 min read·Tags: data-contracts, data-quality, schema-registry, dbt, data-engineering, kafka

Data Contracts for Teams

Every data engineer has hit this wall: the upstream team changed a column type, dropped a field, or renamed a table — without telling anyone. Your pipeline failed silently at 3 AM, the dashboard showed zeros, and the business blamed data engineering. Data contracts exist to stop this from happening.

A data contract is a formal, versioned agreement between a data producer (the team that owns a table or event stream) and its consumers (the pipelines and applications that depend on it). Think of it as an API contract, but for data assets. The concept isn't new — service teams have used OpenAPI contracts for years — but its application to data pipelines is recent and still maturing.

Why Teams Resist Contracts (and Why That's Wrong)

The usual objection is overhead. Teams worry about process, documentation, and slowing down development velocity. This reasoning is backwards. Undocumented schema changes cost more — in incident response time, debugging cycles, and eroded trust between teams. A contract forces the conversation before the breaking change lands, not after.

The real overhead isn't writing contracts. It's the lack of them.

What a Data Contract Contains

A complete data contract specifies:

Component	Description	Example
Schema	Field names, types, nullability	`user_id: INT NOT NULL`
Semantics	What each field actually means	"`event_time` is UTC wall-clock time, not server-local time"
SLA	Freshness and availability guarantees	"Updated within 15 min of event, 99.5% uptime"
Ownership	Who is responsible for the dataset	Team: Checkout Platform, Slack: #checkout-data
Versioning	How changes are communicated	Semver: breaking = major bump, additive = minor
Quality rules	Expectations consumers rely on	"`amount > 0` always, `currency` is always ISO 4217"

Contract Formats in Practice

There is no single standard yet. Three approaches dominate real-world teams.

1. YAML-Based Contracts (Open Data Contract Standard)

The Open Data Contract Standard (ODCS) defines a YAML schema for data contracts. It's gaining traction among teams that want a lightweight, version-controlled approach without buying a platform. [VERIFY: check ODCS current version and adoption status]

# ODCS-compatible data contract (simplified)
apiVersion: v2.3.0
kind: DataContract
uuid: "a8b2c3d4-1234-5678-abcd-ef0123456789"
datasetName: orders
version: "1.4.0"
status: active

description:
  purpose: "Order lifecycle events for analytics and downstream ML"
  usage: "Read-only. Do not rely on fields marked internal."

team: checkout-platform
owner: platform-data@company.com

schema:
  - name: orders
    physicalName: checkout.orders
    columns:
      - name: order_id
        logicalType: string
        physicalType: VARCHAR(36)
        required: true
        description: "UUID v4. Immutable after creation."
      - name: user_id
        logicalType: integer
        physicalType: BIGINT
        required: true
      - name: status
        logicalType: string
        physicalType: VARCHAR(20)
        required: true
        description: "Enum: pending, confirmed, shipped, delivered, cancelled"
      - name: amount_usd
        logicalType: number
        physicalType: DECIMAL(10,2)
        required: true
        quality:
          - rule: "amount_usd >= 0"
            action: fail

sla:
  - property: freshness
    value: "15 minutes"
  - property: completeness
    value: "99.9%"

consumers:
  - team: analytics-platform
    contact: analytics@company.com
  - team: ml-platform
    contact: mlops@company.com

This file lives in the producer team's repository, versioned and reviewed like application code. Changes to it trigger notifications to registered consumers.

2. Schema Registry (Confluent / Karapace)

For event-driven architectures on Kafka, Schema Registry enforces contracts at the protocol level. Producers register an Avro, Protobuf, or JSON Schema. Consumers decode messages using the registered schema. Compatibility rules are enforced by the registry — a producer literally cannot publish a breaking schema change without updating the contract version and passing the compatibility check first.

# Register a schema version in Confluent Schema Registry
curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{"schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount_usd\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"}'

# Check compatibility before publishing a new schema version
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{"schema": "<new schema JSON here>"}'
# Returns: {"is_compatible": true}

Schema Registry is the strongest form of contract enforcement available. The trade-off is that it's tightly coupled to the Kafka ecosystem and adds infrastructure complexity.

3. dbt Contracts (dbt Core 1.5+)

For teams using dbt, the contract: enforced: true setting turns the schema YAML file into a runtime-enforced contract. If the model output doesn't match the declared columns and types, the dbt run fails loudly before the data reaches consumers.

# dbt schema.yml — model with enforced contract
models:
  - name: orders
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: varchar
        constraints:
          - type: not_null
          - type: unique
        description: "UUID v4. Immutable."
      - name: amount_usd
        data_type: numeric
        constraints:
          - type: not_null
      - name: status
        data_type: varchar
        constraints:
          - type: not_null

The dbt approach is ideal for the transformation layer — it prevents a model refactor from silently breaking downstream consumers without any additional tooling.

What "Breaking" Actually Means

Teams frequently under-specify this, which leads to disputes. A breaking change is any change that can cause a previously working consumer to fail or produce incorrect results:

Change	Breaking?	Notes
Remove a column	✅ Yes	Always breaking
Rename a column	✅ Yes	Always breaking
Change type (e.g., INT → VARCHAR)	✅ Yes	Always breaking
Tighten nullability (nullable → NOT NULL)	✅ Yes	Rejects previously valid rows
Add a new NOT NULL column without default	✅ Yes	Breaks INSERT statements
Change enum values	✅ Yes	Breaks CASE/IF logic downstream
Add a new nullable column	❌ No	Safe for most consumers
Loosen nullability (NOT NULL → nullable)	❌ No	Safe
Add index or constraint with no type change	❌ No	Transparent to consumers

The Producer/Consumer Protocol

A contract without a process is just documentation. Here's a minimal workflow that holds up in practice:

For producers (schema change protocol):

Any change that could break consumers requires a contract version bump before deployment
Breaking changes require advance notice — define a lead time (e.g., two sprints) in the contract
Additive changes (new nullable field) require a minor version bump and a consumer notification
The contract lives in the producer's repo; PRs against it trigger notifications to all registered consumers

For consumers:

Register as a consumer in the contract file — this is how producers know who to notify
Pin your pipeline configuration to a specific contract version
Opt into version upgrade notifications via your preferred channel (Slack, Jira, email)
Never treat undocumented fields as stable

Common Mistakes

Contracts as documentation only. A contract that isn't checked by any automated process is just a comment. It will drift from reality. Wire the contract into CI/CD, schema registry compatibility checks, or dbt tests.

Skipping semantic definitions. Field types are the easy part. The hard part is agreeing what event_time means — generated timestamp, Kafka ingestion time, or warehouse landing time? Semantic misalignment causes silent wrong results that are far harder to debug than schema errors.

No consumer registry. If you don't know who depends on a dataset, you can't notify them of changes. A consumer list in the contract file is the minimum viable answer.

No deprecation policy. How long do you maintain v1.x after v2.0 ships? Define this before it becomes a negotiation under pressure.

Treating contracts as a platform team problem. Contracts work when every team — including application developers — sees schema stability as their responsibility. If only the data team cares, you're writing contracts into a void.

Tooling Landscape

Tool	Approach	Best For
Soda Core	YAML contracts + quality assertions	dbt/warehouse teams
Confluent Schema Registry	Protocol-level enforcement	Kafka/streaming teams
dbt contracts (1.5+)	Model-level enforcement	dbt transformation layer
OpenMetadata / DataHub	Catalog + contract metadata	Platform teams
Atlan / Collibra	Enterprise data governance	Larger organizations

There's no universal winner. Pick based on where your data actually flows and what tooling your team already operates.

Contracts and Exploration

When a contract is well-defined, ad-hoc exploration becomes much safer. You know the schema, the semantics, and the quality guarantees — so queries are predictable. Harbinger Explorer's natural language interface works well in this context: when you can describe what a dataset's fields actually mean, the AI generates SQL that reflects those semantics rather than guessing from column names alone.

Conclusion

Data contracts shift the conversation from "who broke the pipeline" to "how do we evolve data safely." They require upfront discipline from producers, but they pay off in fewer incidents, faster debugging, and durable trust between teams. Start small: pick one critical dataset, write a YAML contract for it, wire it into CI, and register your consumers. The rest follows.

For runtime quality validation that checks contracts are being honored, see the Data Quality Testing guide.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Data Contracts for Teams

Data Contracts for Teams

Why Teams Resist Contracts (and Why That's Wrong)

What a Data Contract Contains

Contract Formats in Practice

1. YAML-Based Contracts (Open Data Contract Standard)

2. Schema Registry (Confluent / Karapace)

3. dbt Contracts (dbt Core 1.5+)

What "Breaking" Actually Means

The Producer/Consumer Protocol

Common Mistakes

Tooling Landscape

Contracts and Exploration

Conclusion

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Data Contracts for Teams

Why Teams Resist Contracts (and Why That's Wrong)

What a Data Contract Contains

Contract Formats in Practice

1. YAML-Based Contracts (Open Data Contract Standard)

2. Schema Registry (Confluent / Karapace)

3. dbt Contracts (dbt Core 1.5+)

What "Breaking" Actually Means

The Producer/Consumer Protocol

Common Mistakes

Tooling Landscape

Contracts and Exploration

Conclusion

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Command Palette