Relational Mental Model

SQL is a declarative language over relations. You describe the result; the engine chooses a physical plan. Staff-level SQL requires being fluent in both layers.

Relation, Grain, Key

The most important question:

One row represents what?

That is the grain. A primary key or natural key is only meaningful relative to a grain.

Table	Possible grain	Common trap
`orders`	one row per order	Joining to items multiplies order rows.
`order_items`	one row per order-product line	Summing order total after joining to items double counts.
`events`	one row per emitted event	Counting users requires distinct or prior user-grain collapse.
`feature_snapshots`	one row per entity-feature-time	Pivoting features can duplicate entity rows if snapshot time is not fixed.

Bags, Not Sets

SQL query results are usually bags: duplicates are preserved unless you remove them.

SELECT customer_id
FROM orders;

This returns one row per order, not one row per customer. Use DISTINCT only when that is the intended semantic operation, not as a bandage over a bad join.

Functional Dependencies

If order_id -> customer_id, each order_id has exactly one customer_id. Optimizers may exploit these dependencies when constraints are declared, and humans should use them to reason about joins.

Interview phrase:

I want to establish whether this is a many-to-one dimensional join or whether it can fan out. If it can fan out, I will pre-aggregate or deduplicate before joining.

Join Cardinality

Join shape	Result risk
one-to-one	Low, but still check missing rows.
many-to-one	Usually safe for enrichment.
one-to-many	Safe only if row multiplication is intended.
many-to-many	High risk; often needs bridge-table logic or pre-aggregation.

Projection Does Not Change Grain

This query still has order grain even though it selects only customer_id:

SELECT customer_id
FROM orders
WHERE status = 'paid';

To change the grain, use grouping, distinct, window filtering, or a semijoin.

SELECT customer_id
FROM orders
WHERE status = 'paid'
GROUP BY customer_id;

CTEs As Contracts

Use CTEs to name semantic transitions.

WITH paid_orders AS (
	SELECT order_id, customer_id, order_ts, total_amount
	FROM orders
	WHERE status = 'paid'
),
customer_revenue AS (
	SELECT customer_id, SUM(total_amount) AS revenue
	FROM paid_orders
	GROUP BY customer_id
)
SELECT customer_id, revenue
FROM customer_revenue;

This is not only readability. It exposes the grain at each step.

Practice

1. orders has one row per order and order_items has one row per order item. What happens if you sum orders.total_amount after joining to order_items?

You double count orders with multiple items because the order row is repeated once per item. Either aggregate from orders alone or pre-aggregate item-level revenue separately.

SELECT SUM(total_amount) AS revenue
FROM orders
WHERE status = 'paid';

2. Why is DISTINCT a dangerous fix after a bad join?

It may hide row multiplication without restoring the intended measure. If duplicated rows have different metric values, DISTINCT on projected columns can silently drop real data or keep amplified data. Fix the join cardinality or aggregate to the correct grain before joining.

3. How do you prove a dimension table is safe to join many-to-one?

Check uniqueness on the join key.

SELECT product_id, COUNT(*) AS row_count
FROM products
GROUP BY product_id
HAVING COUNT(*) > 1;

No rows means product_id is unique in products for the current data. In production, enforce it with a constraint or data quality test.