Skip to content

Relational Mental Model

SQL is a declarative language over relations. You describe the result; the engine chooses a physical plan. Staff-level SQL requires being fluent in both layers.

The most important question:

One row represents what?

That is the grain. A primary key or natural key is only meaningful relative to a grain.

TablePossible grainCommon trap
ordersone row per orderJoining to items multiplies order rows.
order_itemsone row per order-product lineSumming order total after joining to items double counts.
eventsone row per emitted eventCounting users requires distinct or prior user-grain collapse.
feature_snapshotsone row per entity-feature-timePivoting features can duplicate entity rows if snapshot time is not fixed.

SQL query results are usually bags: duplicates are preserved unless you remove them.

SELECT customer_id
FROM orders;

This returns one row per order, not one row per customer. Use DISTINCT only when that is the intended semantic operation, not as a bandage over a bad join.

If order_id -> customer_id, each order_id has exactly one customer_id. Optimizers may exploit these dependencies when constraints are declared, and humans should use them to reason about joins.

Interview phrase:

I want to establish whether this is a many-to-one dimensional join or whether it can fan out. If it can fan out, I will pre-aggregate or deduplicate before joining.

Join shapeResult risk
one-to-oneLow, but still check missing rows.
many-to-oneUsually safe for enrichment.
one-to-manySafe only if row multiplication is intended.
many-to-manyHigh risk; often needs bridge-table logic or pre-aggregation.

This query still has order grain even though it selects only customer_id:

SELECT customer_id
FROM orders
WHERE status = 'paid';

To change the grain, use grouping, distinct, window filtering, or a semijoin.

SELECT customer_id
FROM orders
WHERE status = 'paid'
GROUP BY customer_id;

Use CTEs to name semantic transitions.

WITH paid_orders AS (
	SELECT order_id, customer_id, order_ts, total_amount
	FROM orders
	WHERE status = 'paid'
),
customer_revenue AS (
	SELECT customer_id, SUM(total_amount) AS revenue
	FROM paid_orders
	GROUP BY customer_id
)
SELECT customer_id, revenue
FROM customer_revenue;

This is not only readability. It exposes the grain at each step.

1. orders has one row per order and order_items has one row per order item. What happens if you sum orders.total_amount after joining to order_items?

You double count orders with multiple items because the order row is repeated once per item. Either aggregate from orders alone or pre-aggregate item-level revenue separately.

SELECT SUM(total_amount) AS revenue
FROM orders
WHERE status = 'paid';
2. Why is DISTINCT a dangerous fix after a bad join?

It may hide row multiplication without restoring the intended measure. If duplicated rows have different metric values, DISTINCT on projected columns can silently drop real data or keep amplified data. Fix the join cardinality or aggregate to the correct grain before joining.

3. How do you prove a dimension table is safe to join many-to-one?

Check uniqueness on the join key.

SELECT product_id, COUNT(*) AS row_count
FROM products
GROUP BY product_id
HAVING COUNT(*) > 1;

No rows means product_id is unique in products for the current data. In production, enforce it with a constraint or data quality test.