Why Data Lineage Tools Miss Relationships

Lineage Answers a Different Question

Data lineage tools answer the question "where did this data come from and where does it go?" They trace data from source systems through transformations to final outputs. This is valuable for debugging, compliance, and impact analysis.

But lineage does not answer a different, equally important question: "how do these entities relate to each other?" Knowing that the orders table flows from PostgreSQL through Fivetran to Snowflake through dbt to a Looker dashboard tells you the pipeline. It does not tell you that orders.customer_id references customers.id, or that the relationship is one-to-many, or that there are 47 other relationships in the schema that nobody documented.

These are fundamentally different questions. Confusing them leads teams to assume their lineage tool covers relationship documentation. It does not.

What Lineage Tracks vs What It Does Not

Lineage tools track data flow: source to transformation to destination. They know that column A in table X was derived from column B in table Y through a specific SQL transformation. They can tell you that if table Y changes, table X is affected.

What lineage does not track:

Which columns are foreign keys referencing other tables

The cardinality of relationships between entities (one-to-one, one-to-many, many-to-many)

Implicit relationships that exist in application logic but not in SQL transformations

Cross-database entity references that do not flow through a tracked pipeline

Relationships between tables in the same database that are used in ad-hoc queries but not in any documented pipeline

Lineage is about the journey of data. Relationships are about the structure of data.

The JOIN Blind Spot

Consider a simple analytics query:

sql

1SELECT c.name, SUM(o.amount) as total_spend
2FROM customers c
3JOIN orders o ON c.id = o.customer_id
4GROUP BY c.name

Lineage tools can trace the output columns back to their sources. They know that total_spend comes from orders.amount and name comes from customers.name. But the JOIN condition, the relationship between customers.id and orders.customer_id, is treated as a query detail, not as a documented entity relationship.

If an analyst writes this JOIN in a notebook, the lineage tool may not see it at all. Ad-hoc queries are often outside the scope of lineage tracking, which focuses on managed pipelines and models.

Column-Level Lineage Gets Closer But Still Falls Short

Column-level lineage is more granular. It tracks which specific columns feed into which outputs, including through JOINs and transformations. This gets closer to relationship documentation because it captures that orders.customer_id was used in a JOIN with customers.id.

But there are gaps. Column-level lineage records that a JOIN happened in a specific query or model. It does not declare that a relationship exists between orders and customers as a schema-level fact. The relationship is an observation from one query, not a documented truth about the data model.

Multiple queries might JOIN the same tables on different columns. Column-level lineage captures each instance separately. It does not synthesize them into a coherent entity-relationship model.

Entity Relationships Are Not Data Flow

The conceptual difference matters. Data lineage is a directed graph: data flows from A to B to C. Entity relationships are an undirected graph: table A relates to table B, regardless of which direction data flows.

An orders table does not "flow" to a customers table. They are related. The relationship has properties (which columns connect them, the cardinality, whether it is enforced) that are independent of any specific data pipeline.

This distinction explains why lineage tools do not capture relationships well. They are designed to model flow, not structure. Adding relationship documentation to a lineage tool is possible but awkward, like using a road map to describe building architecture.

Practical Consequences

The gap between lineage and relationship documentation creates three concrete problems.

Semantic layer gaps. Semantic layer tools like dbt MetricFlow need entity relationships to generate JOINs. Lineage can show which tables a metric depends on, but it cannot provide the relationship definitions the semantic engine needs. You still have to define those manually.

Governance blind spots. Data governance requires understanding both how data flows and how data relates. If you know that customer data flows from source to warehouse but do not know which tables in the warehouse reference customer entities, your access controls and retention policies have holes.

AI agent limitations. AI agents that query databases need relationship context to generate correct SQL. Lineage tells the agent which tables exist and how they were built. Relationships tell the agent how to JOIN them. Without both, the agent can find data but cannot combine it correctly.

Lineage Plus Relationship Discovery

The answer is not to replace lineage with relationship discovery. Both solve real problems. The answer is to recognize that they are complementary and invest in both.

Lineage tells you the pipeline: where data comes from, how it transforms, where it goes. Relationship discovery tells you the model: how entities connect, what the cardinality is, which relationships are explicit and which are implicit.

Together, they provide the complete metadata picture that data governance, semantic layers, and AI agents need. Separately, each leaves gaps that the other fills.