How Semantic Layers Work on the Backend

Semantic layers are marketed as a business-user feature, but they are fundamentally a backend engineering problem. A semantic layer takes a high-level question like "what was revenue last quarter" and generates the correct SQL to answer it. Understanding how that translation works is essential for building one that actually produces trustworthy results.

The Translation Layer

A semantic layer is a compiler. It takes input in one language (business questions or metric definitions) and produces output in another (SQL). The quality of the output depends entirely on the quality of the intermediate representation: the semantic model.

Question

Semantic Layer

Generated SQL

The semantic model defines three things:

Entities: the nouns in your data (customers, orders, products)

Measures: the numbers you can compute (revenue, count of orders, average deal size)

Dimensions: the ways you can slice the numbers (by date, by region, by product category)

When a consumer asks "revenue by quarter," the semantic layer looks up the definition of "revenue" (a measure), determines which table it comes from, identifies the time dimension, and generates a GROUP BY query with the correct aggregation.

Architecture of a Semantic Layer

If you peel back the marketing diagrams, the actual architecture is straightforward.

Looker

Hex

AI Agent

Python SDK

Semantic Layer

MetricFlow / Cortex / Genie

Data Warehouse

Snowflake / BigQuery / Databricks

Consumers sit at the top -- BI tools, notebooks, AI agents, custom applications. They send queries in whatever high-level format they prefer. Looker sends LookML. An AI agent sends a natural language question that gets parsed into a structured metric request.

The semantic engine sits in the middle. MetricFlow, Cortex, Genie -- pick your flavor. It takes the structured request, resolves it against the semantic model, and spits out SQL.

The data warehouse sits at the bottom and has no idea any of this is happening. It receives SQL, runs it, returns rows. The simplicity of this bottom layer is the whole point.

So the interesting engineering is all in the middle. Here is how two major implementations handle it.

How MetricFlow Generates SQL

MetricFlow is the engine behind the dbt Semantic Layer. When you define a metric in dbt, MetricFlow handles the SQL generation.

A metric definition looks like this:

yaml

1semantic_models:
2  - name: orders
3    defaults:
4      agg_time_dimension: order_date
5    entities:
6      - name: order_id
7        type: primary
8      - name: customer_id
9        type: foreign
10    measures:
11      - name: revenue
12        agg: sum
13        expr: amount
14      - name: order_count
15        agg: count
16        expr: order_id
17    dimensions:
18      - name: order_date
19        type: time
20      - name: status
21        type: categorical

When you query revenue grouped by quarter, MetricFlow:

Finds the semantic model that defines revenue (orders)

Resolves the measure expression (SUM(amount))

Identifies the time dimension (order_date) and applies the grain (quarter)

Generates the SQL:

sql

1SELECT
2  DATE_TRUNC('quarter', order_date) AS metric_time__quarter,
3  SUM(amount) AS revenue
4FROM analytics.production.fct_orders
5GROUP BY DATE_TRUNC('quarter', order_date)
6ORDER BY metric_time__quarter

The power shows when queries span multiple entities. If you ask for revenue by customer_region, MetricFlow traces the foreign key from orders to customers, generates the JOIN, and includes the region dimension:

sql

1SELECT
2  c.region AS customer__region,
3  SUM(o.amount) AS revenue
4FROM analytics.production.fct_orders o
5JOIN analytics.production.dim_customers c
6  ON o.customer_id = c.customer_id
7GROUP BY c.region

MetricFlow needs to know the relationship between orders and customers to generate that JOIN. This is where the entity definitions and their types (primary, foreign) matter. Miss a relationship and MetricFlow cannot traverse from one entity to another.

How Cortex Analyst Differs

Snowflake Cortex Analyst takes a different approach. Instead of a programmatic metric API, it accepts natural language queries and uses an LLM to generate SQL based on a semantic model file.

The semantic model is a YAML file that describes your tables, but with business-friendly descriptions:

yaml

1tables:
2  - name: fact_orders
3    description: "All completed customer orders with revenue data"
4    columns:
5      - name: amount
6        description: "Order total in USD after discounts"
7      - name: order_date
8        description: "Date the order was placed"
9      - name: customer_id
10        description: "References dim_customers"
11    measures:
12      - name: total_revenue
13        expr: "SUM(amount)"
14        description: "Total revenue in USD"

When a user asks "what was total revenue last quarter," Cortex:

Parses the natural language into an intent

Matches "total revenue" to the measure definition

Interprets "last quarter" relative to the current date

Generates SQL using the table and column definitions

The tradeoff is flexibility versus predictability. Users can phrase questions however they want, which is great for adoption. But the same question might generate slightly different SQL on different runs. The results should be equivalent, but "should be" makes some data engineers nervous, and reasonably so.

What the Semantic Layer Needs From Your Schema

Despite the architectural differences, MetricFlow and Cortex need the same raw ingredients from your schema:

Table and column inventory

Every table, every column, every data type. This is the boring but essential baseline. You cannot query what the semantic layer does not know about, and a missing table is a silent gap that nobody notices until the CFO asks a question that returns no results.

Relationships between entities

This one trips people up the most. JOINs are the hardest part of SQL generation. The semantic layer needs to know that customer_id in the orders table references customer_id in the customers table, and whether that is many-to-one or many-to-many. Get it wrong and your aggregations silently double-count.

Business definitions

Which column is "revenue"? What filter defines "active customer"? No amount of schema metadata answers these questions. This is the part that requires humans sitting in a room until everyone agrees. The semantic layer just gives you a place to write down the answer.

Time dimensions

Nearly every metric query involves time, and time is surprisingly tricky. The semantic layer needs to know which columns are temporal, what grains make sense, and how your fiscal calendar maps to the regular one. A "quarterly revenue" number is wrong if the tool thinks Q1 starts in January but your finance team starts it in February.

The Metadata Bootstrapping Problem

Building a semantic model from scratch is tedious. For a database with 100 tables, you need to document every entity, every relationship, every measure, and every dimension. Most teams start enthusiastically and run out of energy around table 30.

Squish exists to skip that mechanical first phase. Connect your database and it discovers:

Every table and column across your connected databases

All relationships, including the implicit ones your ORM created without constraints

Primary key and foreign key classification

Data types that help distinguish measures from dimensions

Export the results as a semantic model YAML and you have a first draft that covers the structural metadata. You still need to add business definitions (what "revenue" means, what "active" means), but you skip the mechanical work of mapping tables, columns, and relationships by hand.

Relationship Discovery Is the Foundation

The most critical piece of metadata for a semantic layer is the relationship graph. Without it, the semantic engine cannot generate JOINs, which means it cannot answer any question that spans multiple tables. And most interesting business questions span multiple tables.

Explicit foreign keys cover some relationships. But as we have written about before, production databases typically have 3-5x more implicit relationships than explicit ones. A semantic layer built only on explicit foreign keys has blind spots wherever implicit relationships exist.

Automated discovery finds both. The semantic layer gets a complete relationship graph, which means it can generate JOINs for every valid entity traversal, not just the ones someone remembered to document.

Getting Started

If you are building a semantic layer:

Start with automated schema discovery to get the structural metadata

Map your ten most-asked business questions to specific tables and calculations

Define those as metrics in your chosen tool (MetricFlow, Cortex, or Genie)

Test each metric against known-good results from your existing reports

Expand coverage as consumers adopt the semantic layer

The structural metadata is grunt work. The business definitions are where your team actually adds value. Automate the grunt work.