First 90 minutes
Read Query Order/NULLs/Joins, Grouping/Windowing, and the Beginner plus Intermediate problem banks. Speak the answers before opening each reveal.
Target: staff data engineer / staff ML engineer interviews where SQL is a major signal.
This is not a syntax crib sheet. The goal is to move from “I can write queries” to “I can reason about correctness, performance, distributed execution, feature pipelines, and data product tradeoffs under interview pressure.”
Your default message:
I write SQL as a specification of data intent, then validate whether the physical plan, data model, and operational constraints can actually support that intent at production scale.
At staff level, interviewers are not only testing whether the query returns the right rows. They are testing whether you catch grain mismatches, duplicate amplification, NULL traps, temporal leakage, skew, shuffle cost, stale stats, bad partitioning, and the difference between a query that works once and a data contract that can run every day.
First 90 minutes
Read Query Order/NULLs/Joins, Grouping/Windowing, and the Beginner plus Intermediate problem banks. Speak the answers before opening each reveal.
Optimization loop
Read Optimizer/Indexes/EXPLAIN and Rewrite Playbook, then practice the Optimization Problems page until you can explain plan shape without running it.
Volume reps
Use the topic drill banks for 260+ revealable SQL prompts across joins, windows, temporal analytics, ML datasets, distributed execution, data quality, and staff scenarios.
Staff loop
Use Distributed SQL, Feature/Label SQL, and Staff Scenarios to practice architecture-grade answers with SQL examples.
Final polish
Use Final 24 Hours and Glossary to tighten vocabulary, answer shape, and failure-mode checklists.
Most examples use this conceptual schema. Types are intentionally generic.
customers(customer_id, signup_ts, country, acquisition_channel)
orders(order_id, customer_id, order_ts, status, currency, total_amount)
order_items(order_id, product_id, quantity, unit_price)
products(product_id, category, created_ts, is_active)
events(event_id, user_id, event_ts, event_name, session_id, properties)
sessions(session_id, user_id, started_at, ended_at, device_type)
payments(payment_id, order_id, payment_ts, amount, status)
experiments(user_id, experiment_name, variant, assigned_ts)
predictions(entity_id, prediction_ts, score, model_version)
labels(entity_id, label_ts, label_value)
feature_snapshots(entity_id, snapshot_ts, feature_name, feature_value)
| Invariant | Why it matters |
|---|---|
| Grain before join | Most wrong SQL comes from multiplying rows before aggregation. |
| Time before label | ML features must be computed from information available before prediction time. |
| Filter before shuffle | In distributed systems, reducing bytes early usually dominates clever syntax. |
| Semantics before tuning | A fast wrong query is worse than a slow visible one. |
| Plan before guess | Use EXPLAIN and row counts to verify optimizer assumptions. |
Senior candidates optimize syntax. Staff candidates optimize the contract between SQL semantics, physical execution, data model, and downstream consumers.