Final 24 Hours

Use this as a compact rehearsal sheet.

Default Opening For Any SQL Problem

Say this before writing:

I’ll first define the output grain and time boundary, then build the query in layers. I’ll watch for join fanout, NULL behavior, and whether filters can be pushed before expensive joins or shuffles.

Query Correctness Checklist

What is one output row?
Which rows must be preserved?
Is this event time or processing time?
Does any join fan out?
Are NULLs included or excluded?
Are duplicates meaningful?
Are ties deterministic?
Does an outer join accidentally become inner?
Does the query need a date spine?
Is current-state dimension data leaking into history?

Performance Checklist

Are filters sargable?
Are date filters half-open ranges?
Are only needed columns selected?
Are partitions/files pruned?
Is a join used only for existence?
Can a large input be pre-aggregated?
Are stats stale?
Is there a global sort/window/distinct?
Is a distributed query shuffling huge data?
Is there key skew?
Is COUNT(DISTINCT) the real bottleneck?

Staff-Level Close

End performance answers with:

I would verify the rewrite with EXPLAIN, compare estimated and actual row counts at each stage, and add a test or modeled table if this query expresses a recurring business contract.

End ML dataset answers with:

I would enforce point-in-time joins, define label windows explicitly, test for feature availability after prediction time, and use a time-based validation split to reduce leakage.

End distributed answers with:

I would identify whether the dominant cost is scan, shuffle, skew, spill, or write, because each has a different fix.

High-Yield Patterns To Rehearse

Latest row per key with ROW_NUMBER.
Existence with EXISTS.
Anti join with NOT EXISTS.
Conditional aggregation.
Top N per group.
Gaps and islands.
Sessionization.
Point-in-time feature join.
SCD type 2 interval join.
Revenue by dimension without double counting.
Funnel with ordered steps.
Retention cohort.
Confusion matrix and calibration buckets.
Query rewrite from non-sargable date filter.
Distributed skew diagnosis.

Five Questions You Should Ask Interviewers

What is the expected output grain?
Should rows with no activity appear as zero?
Which timestamp defines the business event?
Are duplicate events possible?
Is exactness required, or are approximate/sketched metrics acceptable?

Red-Flag Smells

Smell	Likely issue
`SELECT DISTINCT` after many joins	hidden fanout
`NOT IN (subquery)`	NULL trap
`WHERE right_table.col = ...` after left join	outer join turned inner
`DATE(timestamp_col)` in predicate	non-sargable filter
current dimension join for historical data	temporal leakage
random train/test split for time data	future leakage
unbounded `COUNT(DISTINCT)` on raw events	expensive global aggregation
one distributed task runs forever	skew
dashboard over raw `SELECT *` events	missing curated mart

Mini Mock

Prompt: "Write SQL for weekly active users and explain how you would make it production-grade."

Start with correctness:

WITH user_days AS (
	SELECT DISTINCT user_id, event_date
	FROM events
),
weekly AS (
	SELECT
		d.week_start_date,
		u.user_id
	FROM week_spine d
	JOIN user_days u
		ON u.event_date >= d.week_start_date
		AND u.event_date < d.week_start_date + INTERVAL '7' DAY
)
SELECT week_start_date, COUNT(DISTINCT user_id) AS wau
FROM weekly
GROUP BY week_start_date;

Production-grade answer:

define user identity rules
exclude bots/test traffic if needed
use event time
handle late data with a watermark/backfill
pre-aggregate user-day activity
partition by event date
test row counts and duplicate event rates
publish a metric contract