Confidence Scoring for Database Relationships

Not All Matches Are Equal

When you analyze a database schema for relationships, you find candidates everywhere. Any column named user_id is a potential foreign key to a users table. Any column with a matching data type could reference another table. The challenge is not finding candidates. It is knowing which ones are real.

A naive approach treats all name-matching columns as relationships. This produces dozens of false positives. A status column in the orders table matches a status column in the users table by name and type, but they are unrelated. A country_code column appears in five tables, but only two of them reference the same countries lookup table.

Confidence scoring solves this by combining multiple signals into a single number that represents how likely a candidate relationship is to be real.

The Signals

Four types of signals contribute to a confidence score, each capturing a different aspect of relationship evidence.

Name Match

92%

Type Compatible

INT:INT

Cardinality

1:N

Name matching is the most intuitive signal. A column named customer_id in the orders table almost certainly references the customers table. The match is not always this clean. Sometimes the column is cust_id, customerID, or buyer_id. Name matching algorithms handle variations by normalizing names, stripping common suffixes, and comparing against known patterns from the schema.

A strong name match scores high. A weak or partial match scores lower but does not eliminate the candidate. Name matching alone has too many false positives (coincidental naming) and false negatives (unconventional naming) to be reliable by itself.

Type compatibility filters out physically impossible relationships. An INTEGER column cannot be a foreign key to a VARCHAR primary key. A TIMESTAMP column is not referencing a BOOLEAN. Type checking is a binary filter for obvious mismatches and a scoring factor for compatible-but-not-identical types. An INT referencing a BIGINT is more likely than an INT referencing a FLOAT.

Cardinality analysis examines the distribution of values. A true foreign key column has a cardinality pattern that matches its role. In a one-to-many relationship, the child column (like order.customer_id) has many rows with the same value, while the parent column (customer.id) has all unique values. Cardinality analysis checks whether the actual distribution matches the expected pattern for a foreign key.

If a candidate child column has all unique values and the parent column also has all unique values, that suggests a one-to-one relationship or no relationship at all. The pattern provides evidence either way.

Value overlap is the strongest signal when available. If 98% of values in orders.customer_id also exist in customers.id, the columns are almost certainly related. If only 5% overlap, they probably are not, despite matching names.

Value overlap requires statistical sampling, which means running COUNT and COUNT(DISTINCT) queries against actual data. This is more invasive than metadata-only analysis but provides the highest-quality signal.

How Signals Combine Into a Score

Each signal produces a number between 0 and 1. These are combined using a weighted formula where stronger signals carry more weight.

Name match might contribute 30% of the score. Type compatibility 15%. Cardinality 25%. Value overlap 30%. The weights reflect the relative reliability of each signal. Value overlap and cardinality are more reliable than name matching alone.

The combined score falls on a 0-100 scale. Scores above 90 indicate high confidence: multiple signals agree strongly. Scores between 60 and 90 indicate medium confidence: some signals are strong but others are ambiguous. Scores below 60 indicate low confidence: the match is speculative.

Real-World Scoring Examples

High confidence (95+). Column named user_id, INTEGER type, in a table called orders. Parent table users has a primary key id of type INTEGER. Cardinality shows many-to-one pattern. Value overlap is 99%. Every signal agrees. This is almost certainly a real relationship.

Medium confidence (70-85). Column named ref_code of type VARCHAR in a table called transactions. A table called reference_codes has a primary key code of type VARCHAR. Name match is partial. Types are compatible. Cardinality pattern is plausible. Value overlap is 78%. Probably a real relationship, but needs human verification.

Low confidence (40-60). Column named status of type INTEGER in a table called orders. A table called statuses has a primary key id of type INTEGER. Name match is generic. Types are compatible. But cardinality shows only 5 distinct values in the child column, which could be an enum rather than a foreign key. Value overlap is 100% but with only 5 values, that is expected by chance. This might be a relationship or it might be coincidental.

False Positives and How to Handle Them

False positives are unavoidable in automated discovery. The goal is not to eliminate them but to rank them low enough that they do not pollute the high-confidence results.

Common sources of false positives:

Generic column names like status, type, name that appear in many tables

Columns that share a name and type by coincidence, like amount in two unrelated tables

Lookup tables with very few rows, where value overlap is high by chance

Denormalized columns that were copied from a parent table but are not true references

The confidence score handles most of these by weighing multiple signals. A status column might match by name, but its cardinality pattern (very few distinct values) and context (generic name with no clear parent table) pull the score down.

For the remaining false positives, human review is the final filter. The scoring system makes this practical by ranking results. A human reviewing the top 50 results can quickly confirm or reject candidates, starting with the highest-confidence matches that are most likely correct.

Using Scores in Practice

Confidence scores enable graduated workflows. High-confidence relationships can be automatically added to documentation and dbt YAML. Medium-confidence relationships go to a review queue. Low-confidence relationships are logged for reference but not surfaced as recommendations.

This tiered approach means that automated discovery produces immediate value (the high-confidence relationships) without requiring a human to review every candidate. The review effort focuses on the medium tier where human judgment adds the most value.

Scores also provide a feedback mechanism. When a human confirms or rejects a candidate, that feedback can adjust the scoring model for future runs. Over time, the system learns which patterns are real relationships and which are coincidental in your specific schema.