dbt Best Practices That Actually Matter in Production

Most dbt best practices guides read like a style guide written by someone who has never had a pipeline break at 3am. This is not that. These are the patterns that survive contact with production workloads, messy source data, and teams that grow faster than their documentation.

The DAG Is the Product

Your dbt DAG is not an implementation detail. It is the most important artifact your data team produces. Treat it that way.

A well-structured DAG has clear layers. Sources feed staging models. Staging models feed intermediate models. Intermediate models feed marts. If your DAG looks like a plate of spaghetti, your data warehouse does too.

The most common DAG mistake is skipping layers. Someone needs a quick report, so they write a mart model that reads directly from a source. It works. Then another model depends on it. Then someone renames a column in the source, and two downstream models break instead of one.

Staging Models Are Non-Negotiable

Every source table gets a staging model. No exceptions. The staging model does three things: renames columns to your convention, casts data types explicitly, and filters out obviously bad rows.

raw_payments

user_ID

payment_amnt

pymnt_date

→

rename

cast

stg_payments

user_id

payment_amount

payment_date

→

join

aggregate

fct_revenue

user_id

total_amount

period

Here is what a staging model looks like in practice:

sql

1SELECT
2  id AS payment_id,
3  user_id,
4  CAST(amount AS numeric(10,2)) AS payment_amount,
5  created_at AS payment_created_at,
6  status AS payment_status
7FROM {{ source('stripe', 'payments') }}
8WHERE id IS NOT NULL

The staging layer is your firewall between messy source data and clean analytics. When a source schema changes, you fix it in one place. Every downstream model is insulated.

Some teams skip staging because it feels like boilerplate. Then they end up with fifteen models that all cast the same column differently. The staging layer is not overhead. It is the thing that prevents you from spending your Friday debugging a type mismatch.

Naming Conventions That Scale

Nobody wants to talk about naming conventions, but bad names are the reason your new hire spends three days figuring out which model to use. A good convention lets any engineer look at a model name and immediately know what it does and where it sits in the DAG.

The convention that works:

yaml

1models:
2  - name: stg_stripe__payments
3  - name: stg_salesforce__accounts
4  - name: int_payments_pivoted_to_orders
5  - name: fct_orders
6  - name: dim_customers

The pattern is {layer}_{source}__{entity} for staging, int_{description} for intermediate, and {type}_{entity} for marts where type is fct for facts or dim for dimensions.

Double underscores separate the source system from the entity name. Single underscores separate words within a name. This is not arbitrary. When you have 200 models, being able to grep for stg_stripe__ to find all staging models from Stripe is the difference between a quick fix and a twenty-minute search.

Testing Is Not Optional

The built-in dbt tests catch real problems. At minimum, every primary key gets unique and not_null. Every foreign key gets relationships. Every enum column gets accepted_values.

yaml

1models:
2  - name: fct_orders
3    columns:
4      - name: order_id
5        tests:
6          - unique
7          - not_null
8      - name: customer_id
9        tests:
10          - not_null
11          - relationships:
12              to: ref('dim_customers')
13              field: customer_id
14      - name: status
15        tests:
16          - accepted_values:
17              values: ['pending', 'completed', 'refunded', 'cancelled']

The relationship test is the one teams skip most often, and it is the one that catches the most insidious bugs. An orphaned foreign key does not cause a query error. It causes a silent wrong result when you JOIN and lose rows.

Beyond schema tests, add data quality tests for business logic. If revenue should never be negative, test for it. If order dates should never be in the future, test for it. These tests are cheap to write and expensive to not have.

Incremental Models Need Guardrails

Incremental models are powerful and dangerous. They process only new rows, which makes large tables manageable. But they also introduce a class of bugs that full-refresh models cannot have: stale data from rows that were updated after initial load.

The safe pattern for incremental models:

sql

1{{config(
2    materialized='incremental',
3    unique_key='order_id',
4    on_schema_change='append_new_columns'
5)}}
6
7SELECT
8  order_id,
9  customer_id,
10  amount,
11  status,
12  updated_at
13FROM {{ ref('stg_orders') }}
14
15{% if is_incremental() %}
16WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
17{% endif %}

Always use a unique_key. Without it, updated rows create duplicates instead of being merged. Always use a reliable timestamp column for the incremental filter. updated_at is safer than created_at because it catches row updates, not just inserts.

Run a full refresh periodically. Weekly or monthly, depending on your data volume. This catches any rows that slipped through the incremental logic and resets the state to a known-good baseline.

ref() Is a Contract

Every {{ ref() }} call in dbt creates a dependency in the DAG. This is not just a convenience for running models in the right order. It is a contract that says "this model depends on that model, and if that model changes, this one needs attention."

Treat ref() accordingly. Never ref a model you do not actually depend on. Never bypass ref by hardcoding table names. And keep {{ source() }} confined to staging models -- if you see it in a mart, something went wrong.

The discipline of routing all dependencies through ref() pays off when you need to answer questions like "what would break if I changed this column?" or "which models does this data flow through?" Those questions are trivial to answer in a project that uses ref() consistently. They are nearly impossible in one that does not.

What Squish Adds to dbt

Squish discovers the relationships that dbt needs you to define manually. When you connect your database, Squish identifies every relationship, including the implicit ones that your ORM created but never constrained. Export the results and you have a starting point for your dbt YAML that includes relationship tests you would not have known to write.

Squish discovers what exists, dbt tests that it stays correct, and your staging layer keeps it clean. You end up with relationship tests you would not have thought to write, covering the implicit foreign keys that your ORM scattered across the schema without telling anyone.