Data Contracts Need Relationship Context

Data Contracts Solve Half the Problem

Data contracts have become one of the most discussed concepts in modern data engineering. The idea is straightforward: producers of data define a contract that specifies the schema, data types, freshness guarantees, and quality expectations for their dataset. Consumers depend on that contract. Changes require negotiation.

This is a genuine improvement over the previous state of affairs, where schema changes propagated downstream without warning and broke dashboards, models, and pipelines. Contracts bring discipline and predictability to data handoffs.

But contracts, as typically implemented, cover individual datasets in isolation. They specify what a table promises about itself. They rarely specify what a table promises about its relationships to other tables.

What Contracts Cover Today

A typical data contract specifies:

Schema definition. Column names, data types, nullability constraints.

Freshness. How often the data is updated and the maximum acceptable lag.

Quality rules. Uniqueness of primary keys, acceptable null rates, value ranges.

Ownership. Which team produces the data and who to contact about changes.

Versioning. How breaking and non-breaking changes are handled.

These are all table-scoped. They describe properties of one dataset. This is useful and necessary, but it leaves a gap.

The Missing Piece

What contracts typically do not cover:

Referential relationships. If the orders table has a customer_id column, does the contract guarantee that every value exists in the customers table?

Cross-contract consistency. If two teams own two tables with a relationship between them, which contract governs the relationship?

Cardinality guarantees. If the contract says customer_id references customers, does it also guarantee the relationship is many-to-one? Or could it be many-to-many?

JOIN safety. Can consumers safely JOIN these two tables on these columns without producing duplicates or losing rows?

These are relationship-level properties. They exist at the intersection of two datasets, not within either one individually.

When Contracts Break at JOIN Boundaries

The limitation becomes visible when a contract-compliant change in one table breaks queries that JOIN it with another table.

Scenario: Team A owns the customers table. Their contract specifies that customer_id is a unique, non-null primary key. Team B owns the orders table. Their contract specifies that customer_id is a non-null foreign key column.

Team A decides to implement soft deletes. Instead of removing rows, they add a deleted_at column and filter deleted records. Their contract still holds. customer_id is still unique and non-null. But now the orders table references customer records that are logically deleted.

Any downstream query that JOINs orders with customers now includes deleted customers. The contract for each table is intact. The relationship between them is broken.

Without a contract that governs the relationship, this class of issue goes undetected until someone notices wrong numbers in a dashboard.

Relationship Contracts: What They Would Look Like

A relationship contract would specify the properties of a relationship between two datasets:

Referential integrity. Every value in the child column exists in the parent column. This is the foreign key guarantee, formalized as a contract term.

Cardinality. The relationship is one-to-many, one-to-one, or many-to-many. This affects how consumers can safely aggregate data after JOINs.

Consistency scope. The relationship holds for all records, only active records, only records within a date range, or some other defined subset.

Change protocol. If either side needs to make a change that affects the relationship (like adding soft deletes), both parties are notified and the impact on the relationship is assessed before the change ships.

Validation. Automated tests verify the relationship properties on a regular cadence, just as current contracts validate individual table properties.

This is not theoretical. It is how foreign key constraints work in operational databases, formalized for the modern data stack where constraints cannot be enforced at the database level.

Contract Enforcement With Discovery

The practical challenge with relationship contracts is knowing which relationships exist. You cannot write a contract for a relationship you do not know about.

This is where automated discovery becomes foundational. Before you can define relationship contracts, you need a complete inventory of relationships across your data platform. Many of these relationships are implicit, undocumented, and unknown to the teams that own the individual tables.

Discovery provides the inventory. For each discovered relationship, the discovery results include the tables involved, the columns that connect them, the cardinality, and a confidence score. This is the raw material from which relationship contracts can be written.

The workflow becomes:

Discover all relationships across your connected databases

Identify which relationships are critical (used in downstream queries, dashboards, or semantic models)

Define relationship contracts for the critical ones

Implement automated validation that checks relationship properties on every data load

Re-run discovery periodically to catch new relationships that need contracts

Data contracts were a step forward from no contracts. Relationship contracts are the next step. The relationships between your datasets are just as important as the datasets themselves, and they deserve the same level of formal specification and automated validation.