Skip to content

Rewrite Playbook

Optimization is usually about reducing rows, columns, shuffles, sorts, and ambiguity as early as semantics allow.

Bad when order_items has many rows per order and you only need product totals:

SELECT
	p.category,
	SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi
	ON oi.order_id = o.order_id
JOIN products p
	ON p.product_id = oi.product_id
WHERE o.status = 'paid'
GROUP BY p.category;

This is not automatically wrong, but if orders has many columns or filters are complex, isolate the small paid order keyset first.

WITH paid_orders AS (
	SELECT order_id
	FROM orders
	WHERE status = 'paid'
),
product_revenue AS (
	SELECT
		oi.product_id,
		SUM(oi.quantity * oi.unit_price) AS revenue
	FROM order_items oi
	JOIN paid_orders po
		ON po.order_id = oi.order_id
	GROUP BY oi.product_id
)
SELECT p.category, SUM(pr.revenue) AS revenue
FROM product_revenue pr
JOIN products p
	ON p.product_id = pr.product_id
GROUP BY p.category;

Bad if it multiplies customers by paid orders:

SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o
	ON o.customer_id = c.customer_id
WHERE o.status = 'paid';

Better:

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.status = 'paid'
);

3. Replace OR With UNION ALL When Selectivity Differs

Section titled “3. Replace OR With UNION ALL When Selectivity Differs”

Sometimes:

WHERE country = 'US' OR acquisition_channel = 'paid_search'

prevents clean access paths. Consider:

SELECT customer_id
FROM customers
WHERE country = 'US'

UNION

SELECT customer_id
FROM customers
WHERE acquisition_channel = 'paid_search';

Use UNION to deduplicate. Use UNION ALL only when overlap is impossible or duplicates are intended.

4. Move Filters Into The Earliest Safe CTE

Section titled “4. Move Filters Into The Earliest Safe CTE”
WITH recent_paid_orders AS (
	SELECT order_id, customer_id, total_amount
	FROM orders
	WHERE status = 'paid'
		AND order_ts >= TIMESTAMP '2026-01-01 00:00:00'
)
SELECT customer_id, SUM(total_amount) AS revenue
FROM recent_paid_orders
GROUP BY customer_id;

The key word is safe. Do not move a filter across an outer join if it changes preserved rows.

Bad:

SELECT
	(SELECT COUNT(*) FROM orders WHERE status = 'paid') AS paid_orders,
	(SELECT COUNT(*) FROM orders WHERE status = 'cancelled') AS cancelled_orders;

Better:

SELECT
	SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders,
	SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled_orders
FROM orders;

6. Use Window Filtering Instead Of Self-Joins

Section titled “6. Use Window Filtering Instead Of Self-Joins”

Find latest order per customer.

WITH ranked AS (
	SELECT
		o.*,
		ROW_NUMBER() OVER (
			PARTITION BY customer_id
			ORDER BY order_ts DESC, order_id DESC
		) AS rn
	FROM orders o
)
SELECT *
FROM ranked
WHERE rn = 1;

This is usually clearer than a self-join on max timestamp, and it handles ties explicitly.

7. Materialize Expensive Reused Intermediates

Section titled “7. Materialize Expensive Reused Intermediates”

If many downstream queries recompute daily user activity, make it a modeled table:

daily_user_activity(user_id, activity_date, event_count, first_event_ts, last_event_ts)

This shifts cost from repeated ad hoc queries to a tested pipeline with freshness and quality checks.

In column stores and distributed systems, selecting fewer columns can reduce I/O and network.

WITH required_events AS (
	SELECT user_id, event_ts, event_name
	FROM events
	WHERE event_date >= DATE '2026-01-01'
)
SELECT ...

If one customer_id or user_id dominates the data, normal hash partitioning can create a hot task.

Possible mitigations:

  • pre-aggregate by finer keys
  • isolate hot keys
  • salt skewed joins
  • broadcast small side
  • repartition on a better key
  • change the data model

If correctness depends on fuzzy matching, recursive graph traversal, or streaming state with exactly-once semantics, SQL may still be part of the solution, but not the entire system. Say that clearly.

1. Rewrite a JOIN + DISTINCT that only checks whether a customer has paid.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.status = 'paid'
);

This expresses a semijoin and avoids row multiplication.

2. A dashboard has five metrics from the same events table and currently scans it five times. What rewrite do you propose?

Use one filtered scan and conditional aggregation.

WITH base AS (
	SELECT event_name, user_id
	FROM events
	WHERE event_date = DATE '2026-06-01'
)
SELECT
	COUNT(*) AS total_events,
	COUNT(DISTINCT user_id) AS active_users,
	SUM(CASE WHEN event_name = 'signup' THEN 1 ELSE 0 END) AS signups,
	SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
	SUM(CASE WHEN event_name = 'refund' THEN 1 ELSE 0 END) AS refunds
FROM base;
3. When is pre-aggregation before join invalid?

When downstream logic needs row-level detail that would be lost, or when the aggregation grain does not preserve the join semantics. Example: pre-aggregating order items by product before joining to order-level customer attributes loses customer allocation.