Skip to content

Final 24 Hours

Use this as a compact rehearsal sheet.

Say this before writing:

I’ll first define the output grain and time boundary, then build the query in layers. I’ll watch for join fanout, NULL behavior, and whether filters can be pushed before expensive joins or shuffles.

  • What is one output row?
  • Which rows must be preserved?
  • Is this event time or processing time?
  • Does any join fan out?
  • Are NULLs included or excluded?
  • Are duplicates meaningful?
  • Are ties deterministic?
  • Does an outer join accidentally become inner?
  • Does the query need a date spine?
  • Is current-state dimension data leaking into history?
  • Are filters sargable?
  • Are date filters half-open ranges?
  • Are only needed columns selected?
  • Are partitions/files pruned?
  • Is a join used only for existence?
  • Can a large input be pre-aggregated?
  • Are stats stale?
  • Is there a global sort/window/distinct?
  • Is a distributed query shuffling huge data?
  • Is there key skew?
  • Is COUNT(DISTINCT) the real bottleneck?

End performance answers with:

I would verify the rewrite with EXPLAIN, compare estimated and actual row counts at each stage, and add a test or modeled table if this query expresses a recurring business contract.

End ML dataset answers with:

I would enforce point-in-time joins, define label windows explicitly, test for feature availability after prediction time, and use a time-based validation split to reduce leakage.

End distributed answers with:

I would identify whether the dominant cost is scan, shuffle, skew, spill, or write, because each has a different fix.

  1. Latest row per key with ROW_NUMBER.
  2. Existence with EXISTS.
  3. Anti join with NOT EXISTS.
  4. Conditional aggregation.
  5. Top N per group.
  6. Gaps and islands.
  7. Sessionization.
  8. Point-in-time feature join.
  9. SCD type 2 interval join.
  10. Revenue by dimension without double counting.
  11. Funnel with ordered steps.
  12. Retention cohort.
  13. Confusion matrix and calibration buckets.
  14. Query rewrite from non-sargable date filter.
  15. Distributed skew diagnosis.

Five Questions You Should Ask Interviewers

Section titled “Five Questions You Should Ask Interviewers”
  • What is the expected output grain?
  • Should rows with no activity appear as zero?
  • Which timestamp defines the business event?
  • Are duplicate events possible?
  • Is exactness required, or are approximate/sketched metrics acceptable?
SmellLikely issue
SELECT DISTINCT after many joinshidden fanout
NOT IN (subquery)NULL trap
WHERE right_table.col = ... after left joinouter join turned inner
DATE(timestamp_col) in predicatenon-sargable filter
current dimension join for historical datatemporal leakage
random train/test split for time datafuture leakage
unbounded COUNT(DISTINCT) on raw eventsexpensive global aggregation
one distributed task runs foreverskew
dashboard over raw SELECT * eventsmissing curated mart
Prompt: "Write SQL for weekly active users and explain how you would make it production-grade."

Start with correctness:

WITH user_days AS (
	SELECT DISTINCT user_id, event_date
	FROM events
),
weekly AS (
	SELECT
		d.week_start_date,
		u.user_id
	FROM week_spine d
	JOIN user_days u
		ON u.event_date >= d.week_start_date
		AND u.event_date < d.week_start_date + INTERVAL '7' DAY
)
SELECT week_start_date, COUNT(DISTINCT user_id) AS wau
FROM weekly
GROUP BY week_start_date;

Production-grade answer:

  • define user identity rules
  • exclude bots/test traffic if needed
  • use event time
  • handle late data with a watermark/backfill
  • pre-aggregate user-day activity
  • partition by event date
  • test row counts and duplicate event rates
  • publish a metric contract