Optimization Problems

For each prompt, say the likely bottleneck before opening the answer.

1. WHERE DATE(order_ts) = DATE '2026-06-01' scans all partitions.

Rewrite as a half-open range.

WHERE order_ts >= TIMESTAMP '2026-06-01 00:00:00'
	AND order_ts < TIMESTAMP '2026-06-02 00:00:00'

Expected improvement: partition pruning, index range scan, or file metadata skipping on order_ts.

2. Query uses JOIN + DISTINCT to find customers with orders.

Use a semijoin.

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
);

Expected improvement: no row multiplication and clearer existence semantics.

3. A revenue query joins orders to items and sums orders.total_amount.

Do not sum order totals after joining to item grain. Aggregate at order grain or use item revenue.

SELECT SUM(total_amount) AS revenue
FROM orders
WHERE status = 'paid';

If category is required, use item-level revenue:

SELECT p.category, SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN products p
	ON p.product_id = oi.product_id
JOIN orders o
	ON o.order_id = oi.order_id
WHERE o.status = 'paid'
GROUP BY p.category;

4. Dashboard scans events separately for signups, purchases, refunds, and active users.

Use one filtered base and conditional aggregation.

WITH base AS (
	SELECT user_id, event_name
	FROM events
	WHERE event_date = DATE '2026-06-01'
)
SELECT
	COUNT(DISTINCT user_id) AS active_users,
	SUM(CASE WHEN event_name = 'signup' THEN 1 ELSE 0 END) AS signups,
	SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
	SUM(CASE WHEN event_name = 'refund' THEN 1 ELSE 0 END) AS refunds
FROM base;

Expected improvement: fewer scans and less repeated I/O.

5. Query joins two huge tables then filters by event date from one side.

Push the filter before the join.

WITH recent_events AS (
	SELECT user_id, event_ts, event_name
	FROM events
	WHERE event_date >= DATE '2026-06-01'
		AND event_date < DATE '2026-07-01'
)
SELECT ...
FROM recent_events e
JOIN predictions p
	ON p.entity_id = e.user_id;

Expected improvement: smaller join input and less shuffle/network.

6. NOT IN query returns zero rows unexpectedly.

The subquery may contain NULL. Use NOT EXISTS.

SELECT c.customer_id
FROM customers c
WHERE NOT EXISTS (
	SELECT 1
	FROM banned_customers b
	WHERE b.customer_id = c.customer_id
);

7. Spark query has one task running for 40 minutes after all others finish.

Likely skew. Inspect top keys and shuffle partition sizes.

SELECT join_key, COUNT(*) AS rows
FROM large_table
GROUP BY join_key
ORDER BY rows DESC
FETCH FIRST 20 ROWS ONLY;

Mitigations: isolate hot keys, pre-aggregate, salt skewed keys, broadcast small side, or change partitioning.

8. COUNT(DISTINCT user_id) over massive event data is too slow.

Options depend on accuracy requirements:

exact: pre-aggregate distinct user-day or user-month tables
approximate: engine-specific approximate distinct sketches
serving: materialized aggregate table
semantic: check if user grain can be established upstream

SQL exact pre-aggregation:

WITH user_days AS (
	SELECT DISTINCT event_date, user_id
	FROM events
)
SELECT event_date, COUNT(*) AS active_users
FROM user_days
GROUP BY event_date;

9. Latest row per key is implemented with a self-join to max timestamp and returns duplicates.

Use ROW_NUMBER with deterministic tie breakers.

WITH ranked AS (
	SELECT
		t.*,
		ROW_NUMBER() OVER (
			PARTITION BY natural_key
			ORDER BY updated_at DESC, source_sequence DESC
		) AS rn
	FROM source_table t
)
SELECT *
FROM ranked
WHERE rn = 1;

10. A left join query is missing rows from the left table.

Check for right-table filters in WHERE.

Wrong:

FROM customers c
LEFT JOIN orders o
	ON o.customer_id = c.customer_id
WHERE o.status = 'paid'

Right when preserving customers:

FROM customers c
LEFT JOIN orders o
	ON o.customer_id = c.customer_id
	AND o.status = 'paid'

11. Query does ORDER BY created_ts DESC over a huge table just to get the latest 100 rows.

Potential fixes:

index or clustering on created_ts
partition prune by recent date if business allows
maintain a latest-records table
avoid sorting full history when only recent partitions can contain latest rows

Example:

SELECT *
FROM events
WHERE event_date >= DATE '2026-06-24'
ORDER BY event_ts DESC
FETCH FIRST 100 ROWS ONLY;

Only valid if you can prove the latest 100 must be in that date range.

12. A CTE is referenced three times and the query is slow.

Some engines inline CTEs; some materialize them; some choose. If the CTE is expensive and reused, consider a temporary table/materialized intermediate, or rewrite to aggregate once.

Staff answer:

I would inspect the plan to see whether the CTE is scanned/recomputed multiple times. If yes, materialize the intermediate with the right grain, columns, and indexes/layout for downstream joins.

13. A query uses SELECT * in a distributed columnar engine.

Project only needed columns.

SELECT user_id, event_ts, event_name
FROM events
WHERE event_date = DATE '2026-06-01';

Expected improvement: less column I/O, less network, less memory, less spill.

14. A join to a dimension table unexpectedly multiplies rows.

Check dimension key uniqueness.

SELECT product_id, COUNT(*) AS rows
FROM products
GROUP BY product_id
HAVING COUNT(*) > 1;

If duplicates are valid because it is type 2, join with the validity interval. If duplicates are invalid, fix upstream data and add a uniqueness test.

15. The optimizer picks a nested-loop join between two large tables.

Likely causes: stale stats, missing join predicates, bad selectivity estimates, disabled hash join, or predicates that hide cardinality. Verify row estimates vs actuals, update stats, check for accidental cross join, and consider rewriting filters/joins to expose equality predicates.