How Semantic Layers Work on the Backend
Semantic layers are marketed as a business-user feature, but they are fundamentally a backend engineering problem. A semantic layer takes a high-level question like "what was revenue last quarter" and generates the correct SQL to answer it. Understanding how that translation works is essential for building one that actually produces trustworthy results.
The Translation Layer
A semantic layer is a compiler. It takes input in one language (business questions or metric definitions) and produces output in another (SQL). The quality of the output depends entirely on the quality of the intermediate representation: the semantic model.
The semantic model defines three things:
When a consumer asks "revenue by quarter," the semantic layer looks up the definition of "revenue" (a measure), determines which table it comes from, identifies the time dimension, and generates a GROUP BY query with the correct aggregation.
Architecture of a Semantic Layer
If you peel back the marketing diagrams, the actual architecture is straightforward.
Consumers sit at the top -- BI tools, notebooks, AI agents, custom applications. They send queries in whatever high-level format they prefer. Looker sends LookML. An AI agent sends a natural language question that gets parsed into a structured metric request.
The semantic engine sits in the middle. MetricFlow, Cortex, Genie -- pick your flavor. It takes the structured request, resolves it against the semantic model, and spits out SQL.
The data warehouse sits at the bottom and has no idea any of this is happening. It receives SQL, runs it, returns rows. The simplicity of this bottom layer is the whole point.
So the interesting engineering is all in the middle. Here is how two major implementations handle it.
How MetricFlow Generates SQL
MetricFlow is the engine behind the dbt Semantic Layer. When you define a metric in dbt, MetricFlow handles the SQL generation.
A metric definition looks like this:
1semantic_models:2 - name: orders3 defaults:4 agg_time_dimension: order_date5 entities:6 - name: order_id7 type: primary8 - name: customer_id9 type: foreign10 measures:11 - name: revenue12 agg: sum13 expr: amount14 - name: order_count15 agg: count16 expr: order_id17 dimensions:18 - name: order_date19 type: time20 - name: status21 type: categoricalWhen you query revenue grouped by quarter, MetricFlow:
revenue (orders)SUM(amount))order_date) and applies the grain (quarter)1SELECT2 DATE_TRUNC('quarter', order_date) AS metric_time__quarter,3 SUM(amount) AS revenue4FROM analytics.production.fct_orders5GROUP BY DATE_TRUNC('quarter', order_date)6ORDER BY metric_time__quarterThe power shows when queries span multiple entities. If you ask for revenue by customer_region, MetricFlow traces the foreign key from orders to customers, generates the JOIN, and includes the region dimension:
1SELECT2 c.region AS customer__region,3 SUM(o.amount) AS revenue4FROM analytics.production.fct_orders o5JOIN analytics.production.dim_customers c6 ON o.customer_id = c.customer_id7GROUP BY c.regionMetricFlow needs to know the relationship between orders and customers to generate that JOIN. This is where the entity definitions and their types (primary, foreign) matter. Miss a relationship and MetricFlow cannot traverse from one entity to another.
How Cortex Analyst Differs
Snowflake Cortex Analyst takes a different approach. Instead of a programmatic metric API, it accepts natural language queries and uses an LLM to generate SQL based on a semantic model file.
The semantic model is a YAML file that describes your tables, but with business-friendly descriptions:
1tables:2 - name: fact_orders3 description: "All completed customer orders with revenue data"4 columns:5 - name: amount6 description: "Order total in USD after discounts"7 - name: order_date8 description: "Date the order was placed"9 - name: customer_id10 description: "References dim_customers"11 measures:12 - name: total_revenue13 expr: "SUM(amount)"14 description: "Total revenue in USD"When a user asks "what was total revenue last quarter," Cortex:
The tradeoff is flexibility versus predictability. Users can phrase questions however they want, which is great for adoption. But the same question might generate slightly different SQL on different runs. The results should be equivalent, but "should be" makes some data engineers nervous, and reasonably so.
What the Semantic Layer Needs From Your Schema
Despite the architectural differences, MetricFlow and Cortex need the same raw ingredients from your schema:
Table and column inventory
Every table, every column, every data type. This is the boring but essential baseline. You cannot query what the semantic layer does not know about, and a missing table is a silent gap that nobody notices until the CFO asks a question that returns no results.
Relationships between entities
This one trips people up the most. JOINs are the hardest part of SQL generation. The semantic layer needs to know that customer_id in the orders table references customer_id in the customers table, and whether that is many-to-one or many-to-many. Get it wrong and your aggregations silently double-count.
Business definitions
Which column is "revenue"? What filter defines "active customer"? No amount of schema metadata answers these questions. This is the part that requires humans sitting in a room until everyone agrees. The semantic layer just gives you a place to write down the answer.
Time dimensions
Nearly every metric query involves time, and time is surprisingly tricky. The semantic layer needs to know which columns are temporal, what grains make sense, and how your fiscal calendar maps to the regular one. A "quarterly revenue" number is wrong if the tool thinks Q1 starts in January but your finance team starts it in February.
The Metadata Bootstrapping Problem
Building a semantic model from scratch is tedious. For a database with 100 tables, you need to document every entity, every relationship, every measure, and every dimension. Most teams start enthusiastically and run out of energy around table 30.
Squish exists to skip that mechanical first phase. Connect your database and it discovers:
Export the results as a semantic model YAML and you have a first draft that covers the structural metadata. You still need to add business definitions (what "revenue" means, what "active" means), but you skip the mechanical work of mapping tables, columns, and relationships by hand.
Relationship Discovery Is the Foundation
The most critical piece of metadata for a semantic layer is the relationship graph. Without it, the semantic engine cannot generate JOINs, which means it cannot answer any question that spans multiple tables. And most interesting business questions span multiple tables.
Explicit foreign keys cover some relationships. But as we have written about before, production databases typically have 3-5x more implicit relationships than explicit ones. A semantic layer built only on explicit foreign keys has blind spots wherever implicit relationships exist.
Automated discovery finds both. The semantic layer gets a complete relationship graph, which means it can generate JOINs for every valid entity traversal, not just the ones someone remembered to document.
Getting Started
If you are building a semantic layer:
The structural metadata is grunt work. The business definitions are where your team actually adds value. Automate the grunt work.