Data Contracts for Teams
Data Contracts for Teams
Every data engineer has hit this wall: the upstream team changed a column type, dropped a field, or renamed a table — without telling anyone. Your pipeline failed silently at 3 AM, the dashboard showed zeros, and the business blamed data engineering. Data contracts exist to stop this from happening.
A data contract is a formal, versioned agreement between a data producer (the team that owns a table or event stream) and its consumers (the pipelines and applications that depend on it). Think of it as an API contract, but for data assets. The concept isn't new — service teams have used OpenAPI contracts for years — but its application to data pipelines is recent and still maturing.
Why Teams Resist Contracts (and Why That's Wrong)
The usual objection is overhead. Teams worry about process, documentation, and slowing down development velocity. This reasoning is backwards. Undocumented schema changes cost more — in incident response time, debugging cycles, and eroded trust between teams. A contract forces the conversation before the breaking change lands, not after.
The real overhead isn't writing contracts. It's the lack of them.
What a Data Contract Contains
A complete data contract specifies:
| Component | Description | Example |
|---|---|---|
| Schema | Field names, types, nullability | user_id: INT NOT NULL |
| Semantics | What each field actually means | "event_time is UTC wall-clock time, not server-local time" |
| SLA | Freshness and availability guarantees | "Updated within 15 min of event, 99.5% uptime" |
| Ownership | Who is responsible for the dataset | Team: Checkout Platform, Slack: #checkout-data |
| Versioning | How changes are communicated | Semver: breaking = major bump, additive = minor |
| Quality rules | Expectations consumers rely on | "amount > 0 always, currency is always ISO 4217" |
Contract Formats in Practice
There is no single standard yet. Three approaches dominate real-world teams.
1. YAML-Based Contracts (Open Data Contract Standard)
The Open Data Contract Standard (ODCS) defines a YAML schema for data contracts. It's gaining traction among teams that want a lightweight, version-controlled approach without buying a platform. [VERIFY: check ODCS current version and adoption status]
# ODCS-compatible data contract (simplified)
apiVersion: v2.3.0
kind: DataContract
uuid: "a8b2c3d4-1234-5678-abcd-ef0123456789"
datasetName: orders
version: "1.4.0"
status: active
description:
purpose: "Order lifecycle events for analytics and downstream ML"
usage: "Read-only. Do not rely on fields marked internal."
team: checkout-platform
owner: platform-data@company.com
schema:
- name: orders
physicalName: checkout.orders
columns:
- name: order_id
logicalType: string
physicalType: VARCHAR(36)
required: true
description: "UUID v4. Immutable after creation."
- name: user_id
logicalType: integer
physicalType: BIGINT
required: true
- name: status
logicalType: string
physicalType: VARCHAR(20)
required: true
description: "Enum: pending, confirmed, shipped, delivered, cancelled"
- name: amount_usd
logicalType: number
physicalType: DECIMAL(10,2)
required: true
quality:
- rule: "amount_usd >= 0"
action: fail
sla:
- property: freshness
value: "15 minutes"
- property: completeness
value: "99.9%"
consumers:
- team: analytics-platform
contact: analytics@company.com
- team: ml-platform
contact: mlops@company.com
This file lives in the producer team's repository, versioned and reviewed like application code. Changes to it trigger notifications to registered consumers.
2. Schema Registry (Confluent / Karapace)
For event-driven architectures on Kafka, Schema Registry enforces contracts at the protocol level. Producers register an Avro, Protobuf, or JSON Schema. Consumers decode messages using the registered schema. Compatibility rules are enforced by the registry — a producer literally cannot publish a breaking schema change without updating the contract version and passing the compatibility check first.
# Register a schema version in Confluent Schema Registry
curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount_usd\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"}'
# Check compatibility before publishing a new schema version
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "<new schema JSON here>"}'
# Returns: {"is_compatible": true}
Schema Registry is the strongest form of contract enforcement available. The trade-off is that it's tightly coupled to the Kafka ecosystem and adds infrastructure complexity.
3. dbt Contracts (dbt Core 1.5+)
For teams using dbt, the contract: enforced: true setting turns the schema YAML file into a runtime-enforced contract. If the model output doesn't match the declared columns and types, the dbt run fails loudly before the data reaches consumers.
# dbt schema.yml — model with enforced contract
models:
- name: orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: varchar
constraints:
- type: not_null
- type: unique
description: "UUID v4. Immutable."
- name: amount_usd
data_type: numeric
constraints:
- type: not_null
- name: status
data_type: varchar
constraints:
- type: not_null
The dbt approach is ideal for the transformation layer — it prevents a model refactor from silently breaking downstream consumers without any additional tooling.
What "Breaking" Actually Means
Teams frequently under-specify this, which leads to disputes. A breaking change is any change that can cause a previously working consumer to fail or produce incorrect results:
| Change | Breaking? | Notes |
|---|---|---|
| Remove a column | ✅ Yes | Always breaking |
| Rename a column | ✅ Yes | Always breaking |
| Change type (e.g., INT → VARCHAR) | ✅ Yes | Always breaking |
| Tighten nullability (nullable → NOT NULL) | ✅ Yes | Rejects previously valid rows |
| Add a new NOT NULL column without default | ✅ Yes | Breaks INSERT statements |
| Change enum values | ✅ Yes | Breaks CASE/IF logic downstream |
| Add a new nullable column | ❌ No | Safe for most consumers |
| Loosen nullability (NOT NULL → nullable) | ❌ No | Safe |
| Add index or constraint with no type change | ❌ No | Transparent to consumers |
The Producer/Consumer Protocol
A contract without a process is just documentation. Here's a minimal workflow that holds up in practice:
For producers (schema change protocol):
- Any change that could break consumers requires a contract version bump before deployment
- Breaking changes require advance notice — define a lead time (e.g., two sprints) in the contract
- Additive changes (new nullable field) require a minor version bump and a consumer notification
- The contract lives in the producer's repo; PRs against it trigger notifications to all registered consumers
For consumers:
- Register as a consumer in the contract file — this is how producers know who to notify
- Pin your pipeline configuration to a specific contract version
- Opt into version upgrade notifications via your preferred channel (Slack, Jira, email)
- Never treat undocumented fields as stable
Common Mistakes
Contracts as documentation only. A contract that isn't checked by any automated process is just a comment. It will drift from reality. Wire the contract into CI/CD, schema registry compatibility checks, or dbt tests.
Skipping semantic definitions. Field types are the easy part. The hard part is agreeing what event_time means — generated timestamp, Kafka ingestion time, or warehouse landing time? Semantic misalignment causes silent wrong results that are far harder to debug than schema errors.
No consumer registry. If you don't know who depends on a dataset, you can't notify them of changes. A consumer list in the contract file is the minimum viable answer.
No deprecation policy. How long do you maintain v1.x after v2.0 ships? Define this before it becomes a negotiation under pressure.
Treating contracts as a platform team problem. Contracts work when every team — including application developers — sees schema stability as their responsibility. If only the data team cares, you're writing contracts into a void.
Tooling Landscape
| Tool | Approach | Best For |
|---|---|---|
| Soda Core | YAML contracts + quality assertions | dbt/warehouse teams |
| Confluent Schema Registry | Protocol-level enforcement | Kafka/streaming teams |
| dbt contracts (1.5+) | Model-level enforcement | dbt transformation layer |
| OpenMetadata / DataHub | Catalog + contract metadata | Platform teams |
| Atlan / Collibra | Enterprise data governance | Larger organizations |
There's no universal winner. Pick based on where your data actually flows and what tooling your team already operates.
Contracts and Exploration
When a contract is well-defined, ad-hoc exploration becomes much safer. You know the schema, the semantics, and the quality guarantees — so queries are predictable. Harbinger Explorer's natural language interface works well in this context: when you can describe what a dataset's fields actually mean, the AI generates SQL that reflects those semantics rather than guessing from column names alone.
Conclusion
Data contracts shift the conversation from "who broke the pipeline" to "how do we evolve data safely." They require upfront discipline from producers, but they pay off in fewer incidents, faster debugging, and durable trust between teams. Start small: pick one critical dataset, write a YAML contract for it, wire it into CI, and register your consumers. The rest follows.
For runtime quality validation that checks contracts are being honored, see the Data Quality Testing guide.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial