"Can SQL Do This?" — Escaping Subquery Hell with Window Functions

Escaping Subquery Hell

"Get me month-over-month growth rate", "Top 3 products in each category", "Cohort retention analysis"...

Ever received these requests and ended up nesting subqueries inside subqueries, inside more subqueries, until you created a query even you couldn't understand? Master Window functions properly, and complex analytical queries become clean and fast.

1. What Are Window Functions? — 1 Minute Explanation

Subquery Approach vs Window Function Approach

"Show me each employee's salary alongside their department average salary"

With subquery:

SELECT
    e.name,
    e.department,
    e.salary,
    (SELECT AVG(salary) FROM employees e2
     WHERE e2.department = e.department) as dept_avg
FROM employees e;

The subquery executes for every row. 10,000 rows means 10,000 executions.

With Window function:

SELECT
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) as dept_avg
FROM employees;

Calculated once. Same result, much faster.

Key Difference

2. OVER() Clause Breakdown

The core of Window functions is the OVER() clause. It has three parts:

SUM(revenue) OVER (
    PARTITION BY category    -- Within which group?
    ORDER BY date            -- In what order?
    ROWS BETWEEN             -- Over what range?
        UNBOUNDED PRECEDING
        AND CURRENT ROW
)

PARTITION BY — Dividing Groups

-- Each product's share of total and category revenue
SELECT
    product_name,
    revenue,
    revenue * 100.0 / SUM(revenue) OVER () as total_share,           -- vs Total
    revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category) as category_share  -- vs Category
FROM products;

Result:

ORDER BY — Setting the Order

Adding ORDER BY enables cumulative calculations:

-- Daily cumulative revenue
SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date) as cumulative_revenue
FROM daily_sales;

Result:

3. LAG/LEAD — The Core of Month-over-Month Calculations

Basic Usage

LAG(column, N, default)  -- Value N rows back (default 1)
LEAD(column, N, default) -- Value N rows forward (default 1)

Month-over-Month Growth Rate (MoM)

WITH monthly_revenue AS (
    SELECT
        strftime('%Y-%m', order_date) as month,
        SUM(total_amount) as revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY strftime('%Y-%m', order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) as prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) as diff,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
          LAG(revenue) OVER (ORDER BY month), 1) as growth_pct
FROM monthly_revenue
ORDER BY month;

Result:

Week-over-Week Same Day (WoW)

SELECT
    date,
    revenue,
    LAG(revenue, 7) OVER (ORDER BY date) as same_day_last_week,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth
FROM daily_sales
WHERE date >= DATE('now', '-14 days');

4. The Ranking Trio — When to Use Which?

ROW_NUMBER vs RANK vs DENSE_RANK

SELECT
    product_name,
    revenue,
    ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num,
    RANK() OVER (ORDER BY revenue DESC) as rank,
    DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank
FROM products;

Result (with ties):

Differences:

ROW_NUMBER: Always sequential numbers (ignores ties)
RANK: Ties get same rank, next rank skipped (1,2,2,4)
DENSE_RANK: Ties get same rank, next rank not skipped (1,2,2,3)

When to Use Which?

ROW_NUMBER — When you need exactly N items

-- Top 3 products per category (exactly 3)
WITH ranked AS (
    SELECT
        category,
        product_name,
        revenue,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
    FROM products
)
SELECT * FROM ranked WHERE rn <= 3;

RANK — When you want to include all tied entries

-- Top 10, but include all if 10th place has ties
WITH ranked AS (
    SELECT product_name, revenue,
           RANK() OVER (ORDER BY revenue DESC) as rank
    FROM products
)
SELECT * FROM ranked WHERE rank <= 10;
-- Returns 12 rows if 3 items are tied for 10th

DENSE_RANK — When you want no gaps in ranking numbers

-- 1st, 2nd, 3rd... without skipping
SELECT product_name, revenue,
       DENSE_RANK() OVER (ORDER BY revenue DESC) as rank
FROM products;

5. Moving Averages and Cumulative Totals

7-Day Moving Average

SELECT
    date,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7
FROM daily_sales;

Result:

Monthly Cumulative Revenue (Reset Each Month)

SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY strftime('%Y-%m', date)  -- Reset monthly
        ORDER BY date
    ) as mtd_revenue  -- Month-to-Date
FROM daily_sales;

Result:

6. ROWS vs RANGE — Subtle but Important Difference

Different Behavior with Duplicate Dates

-- Data: Multiple orders on same date
-- | date       | amount |
-- |------------|--------|
-- | 2024-01-01 | 100    |
-- | 2024-01-01 | 150    |  ← Same date
-- | 2024-01-02 | 200    |

-- ROWS: By physical row order
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) as rows_sum
FROM orders;
-- Result: 100, 250, 450

-- RANGE: Groups same values together
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date RANGE UNBOUNDED PRECEDING) as range_sum
FROM orders;
-- Result: 250, 250, 450  ← Same dates get same total!

Pro Tip: In most cases, ROWS is more intuitive. Use RANGE only when you need "same values processed together."

7. Practical: Cohort Retention Analysis

Group customers by their first purchase month and analyze how many repurchased in subsequent months:

WITH customer_cohort AS (
    -- Customer's first purchase month
    SELECT
        customer_id,
        strftime('%Y-%m', MIN(order_date)) as cohort_month
    FROM orders WHERE status = 'completed'
    GROUP BY customer_id
),
monthly_activity AS (
    -- Customer activity by month
    SELECT DISTINCT
        o.customer_id,
        cc.cohort_month,
        strftime('%Y-%m', o.order_date) as activity_month
    FROM orders o
    JOIN customer_cohort cc ON o.customer_id = cc.customer_id
    WHERE o.status = 'completed'
)
SELECT
    cohort_month,
    COUNT(DISTINCT customer_id) as M0,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+1 month')) THEN customer_id END) as M1,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+2 month')) THEN customer_id END) as M2,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+3 month')) THEN customer_id END) as M3
FROM monthly_activity
GROUP BY cohort_month
ORDER BY cohort_month;

Result:

8. Organizing Complex Queries with CTEs

Subquery Hell

SELECT * FROM (
    SELECT * FROM (
        SELECT customer_id, SUM(amount) as total
        FROM orders GROUP BY customer_id
    ) WHERE total > 100000
) WHERE customer_id IN (SELECT customer_id FROM vip_list);

Hard to read, right?

Organized with CTEs

WITH customer_totals AS (
    SELECT customer_id, SUM(amount) as total
    FROM orders
    GROUP BY customer_id
),
high_value AS (
    SELECT * FROM customer_totals
    WHERE total > 100000
)
SELECT hv.*
FROM high_value hv
WHERE hv.customer_id IN (SELECT customer_id FROM vip_list);

Naming each step makes it easier to read and debug.

9. Practical Example: Daily KPI Dashboard

WITH daily_metrics AS (
    SELECT
        DATE(order_date) as date,
        SUM(total_amount) as revenue,
        COUNT(*) as orders,
        COUNT(DISTINCT customer_id) as customers
    FROM orders
    WHERE status = 'completed'
      AND order_date >= DATE('now', '-30 days')
    GROUP BY DATE(order_date)
)
SELECT
    date,
    revenue,
    orders,
    customers,

    -- Day-over-day
    LAG(revenue) OVER (ORDER BY date) as prev_day_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue) OVER (ORDER BY date), 1) as dod_growth,

    -- Week-over-week (same day)
    LAG(revenue, 7) OVER (ORDER BY date) as week_ago_revenue,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth,

    -- 7-day moving average
    ROUND(AVG(revenue) OVER (
        ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7

FROM daily_metrics
ORDER BY date DESC
LIMIT 14;

Result:

10. Performance Tips: When Window Functions Are Slow

Problem: Multiple Different ORDER BYs Are Slow

-- ❌ Different sort each time → Slow
SELECT
    SUM(revenue) OVER (ORDER BY date) as sum1,
    AVG(revenue) OVER (ORDER BY category) as avg1,  -- Different sort
    LAG(revenue) OVER (ORDER BY product_id) as lag1 -- Another different sort
FROM sales;

Solution: Reuse the Same OVER Clause

-- ✅ Same sort shared → Fast
SELECT
    SUM(revenue) OVER w as running_sum,
    AVG(revenue) OVER w as running_avg,
    LAG(revenue) OVER w as prev_revenue
FROM sales
WINDOW w AS (ORDER BY date);  -- Named Window

Use Indexes

-- Index on PARTITION BY + ORDER BY columns
CREATE INDEX idx_sales_cat_date ON sales(category, date);

-- This query becomes faster
SELECT
    category,
    date,
    SUM(revenue) OVER (PARTITION BY category ORDER BY date) as cumsum
FROM sales;

Key Takeaways: 3 Principles of Window Functions

Aggregate While Preserving Rows: GROUP BY reduces rows, Window Functions preserve them
OVER() Is the Core: PARTITION BY for groups, ORDER BY for order, ROWS/RANGE for scope
LAG/LEAD for Comparisons: Month-over-month, week-over-week time series comparisons use LAG/LEAD

Remember these three things, and you can confidently answer "Yes" to "Can SQL do this?"

Appendix: Window Function Quick Reference

"Can SQL Do This?" — Escaping Subquery Hell with Window Functions

Escaping Subquery Hell

"Get me month-over-month growth rate", "Top 3 products in each category", "Cohort retention analysis"...

1. What Are Window Functions? — 1 Minute Explanation

Subquery Approach vs Window Function Approach

"Show me each employee's salary alongside their department average salary"

With subquery:

SELECT
    e.name,
    e.department,
    e.salary,
    (SELECT AVG(salary) FROM employees e2
     WHERE e2.department = e.department) as dept_avg
FROM employees e;

The subquery executes for every row. 10,000 rows means 10,000 executions.

With Window function:

SELECT
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) as dept_avg
FROM employees;

Calculated once. Same result, much faster.

Key Difference

2. OVER() Clause Breakdown

The core of Window functions is the OVER() clause. It has three parts:

SUM(revenue) OVER (
    PARTITION BY category    -- Within which group?
    ORDER BY date            -- In what order?
    ROWS BETWEEN             -- Over what range?
        UNBOUNDED PRECEDING
        AND CURRENT ROW
)

PARTITION BY — Dividing Groups

-- Each product's share of total and category revenue
SELECT
    product_name,
    revenue,
    revenue * 100.0 / SUM(revenue) OVER () as total_share,           -- vs Total
    revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category) as category_share  -- vs Category
FROM products;

Result:

ORDER BY — Setting the Order

Adding ORDER BY enables cumulative calculations:

-- Daily cumulative revenue
SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date) as cumulative_revenue
FROM daily_sales;

Result:

3. LAG/LEAD — The Core of Month-over-Month Calculations

Basic Usage

LAG(column, N, default)  -- Value N rows back (default 1)
LEAD(column, N, default) -- Value N rows forward (default 1)

Month-over-Month Growth Rate (MoM)

WITH monthly_revenue AS (
    SELECT
        strftime('%Y-%m', order_date) as month,
        SUM(total_amount) as revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY strftime('%Y-%m', order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) as prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) as diff,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
          LAG(revenue) OVER (ORDER BY month), 1) as growth_pct
FROM monthly_revenue
ORDER BY month;

Result:

Week-over-Week Same Day (WoW)

SELECT
    date,
    revenue,
    LAG(revenue, 7) OVER (ORDER BY date) as same_day_last_week,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth
FROM daily_sales
WHERE date >= DATE('now', '-14 days');

4. The Ranking Trio — When to Use Which?

ROW_NUMBER vs RANK vs DENSE_RANK

SELECT
    product_name,
    revenue,
    ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num,
    RANK() OVER (ORDER BY revenue DESC) as rank,
    DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank
FROM products;

Result (with ties):

Differences:

ROW_NUMBER: Always sequential numbers (ignores ties)
RANK: Ties get same rank, next rank skipped (1,2,2,4)
DENSE_RANK: Ties get same rank, next rank not skipped (1,2,2,3)

When to Use Which?

ROW_NUMBER — When you need exactly N items

-- Top 3 products per category (exactly 3)
WITH ranked AS (
    SELECT
        category,
        product_name,
        revenue,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
    FROM products
)
SELECT * FROM ranked WHERE rn <= 3;

RANK — When you want to include all tied entries

-- Top 10, but include all if 10th place has ties
WITH ranked AS (
    SELECT product_name, revenue,
           RANK() OVER (ORDER BY revenue DESC) as rank
    FROM products
)
SELECT * FROM ranked WHERE rank <= 10;
-- Returns 12 rows if 3 items are tied for 10th

DENSE_RANK — When you want no gaps in ranking numbers

-- 1st, 2nd, 3rd... without skipping
SELECT product_name, revenue,
       DENSE_RANK() OVER (ORDER BY revenue DESC) as rank
FROM products;

5. Moving Averages and Cumulative Totals

7-Day Moving Average

SELECT
    date,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7
FROM daily_sales;

Result:

Monthly Cumulative Revenue (Reset Each Month)

SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY strftime('%Y-%m', date)  -- Reset monthly
        ORDER BY date
    ) as mtd_revenue  -- Month-to-Date
FROM daily_sales;

Result:

6. ROWS vs RANGE — Subtle but Important Difference

Different Behavior with Duplicate Dates

-- Data: Multiple orders on same date
-- | date       | amount |
-- |------------|--------|
-- | 2024-01-01 | 100    |
-- | 2024-01-01 | 150    |  ← Same date
-- | 2024-01-02 | 200    |

-- ROWS: By physical row order
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) as rows_sum
FROM orders;
-- Result: 100, 250, 450

-- RANGE: Groups same values together
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date RANGE UNBOUNDED PRECEDING) as range_sum
FROM orders;
-- Result: 250, 250, 450  ← Same dates get same total!

Pro Tip: In most cases, ROWS is more intuitive. Use RANGE only when you need "same values processed together."

7. Practical: Cohort Retention Analysis

Group customers by their first purchase month and analyze how many repurchased in subsequent months:

WITH customer_cohort AS (
    -- Customer's first purchase month
    SELECT
        customer_id,
        strftime('%Y-%m', MIN(order_date)) as cohort_month
    FROM orders WHERE status = 'completed'
    GROUP BY customer_id
),
monthly_activity AS (
    -- Customer activity by month
    SELECT DISTINCT
        o.customer_id,
        cc.cohort_month,
        strftime('%Y-%m', o.order_date) as activity_month
    FROM orders o
    JOIN customer_cohort cc ON o.customer_id = cc.customer_id
    WHERE o.status = 'completed'
)
SELECT
    cohort_month,
    COUNT(DISTINCT customer_id) as M0,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+1 month')) THEN customer_id END) as M1,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+2 month')) THEN customer_id END) as M2,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+3 month')) THEN customer_id END) as M3
FROM monthly_activity
GROUP BY cohort_month
ORDER BY cohort_month;

Result:

8. Organizing Complex Queries with CTEs

Subquery Hell

SELECT * FROM (
    SELECT * FROM (
        SELECT customer_id, SUM(amount) as total
        FROM orders GROUP BY customer_id
    ) WHERE total > 100000
) WHERE customer_id IN (SELECT customer_id FROM vip_list);

Hard to read, right?

Organized with CTEs

WITH customer_totals AS (
    SELECT customer_id, SUM(amount) as total
    FROM orders
    GROUP BY customer_id
),
high_value AS (
    SELECT * FROM customer_totals
    WHERE total > 100000
)
SELECT hv.*
FROM high_value hv
WHERE hv.customer_id IN (SELECT customer_id FROM vip_list);

Naming each step makes it easier to read and debug.

9. Practical Example: Daily KPI Dashboard

WITH daily_metrics AS (
    SELECT
        DATE(order_date) as date,
        SUM(total_amount) as revenue,
        COUNT(*) as orders,
        COUNT(DISTINCT customer_id) as customers
    FROM orders
    WHERE status = 'completed'
      AND order_date >= DATE('now', '-30 days')
    GROUP BY DATE(order_date)
)
SELECT
    date,
    revenue,
    orders,
    customers,

    -- Day-over-day
    LAG(revenue) OVER (ORDER BY date) as prev_day_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue) OVER (ORDER BY date), 1) as dod_growth,

    -- Week-over-week (same day)
    LAG(revenue, 7) OVER (ORDER BY date) as week_ago_revenue,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth,

    -- 7-day moving average
    ROUND(AVG(revenue) OVER (
        ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7

FROM daily_metrics
ORDER BY date DESC
LIMIT 14;

Result:

10. Performance Tips: When Window Functions Are Slow

Problem: Multiple Different ORDER BYs Are Slow

-- ❌ Different sort each time → Slow
SELECT
    SUM(revenue) OVER (ORDER BY date) as sum1,
    AVG(revenue) OVER (ORDER BY category) as avg1,  -- Different sort
    LAG(revenue) OVER (ORDER BY product_id) as lag1 -- Another different sort
FROM sales;

Solution: Reuse the Same OVER Clause

-- ✅ Same sort shared → Fast
SELECT
    SUM(revenue) OVER w as running_sum,
    AVG(revenue) OVER w as running_avg,
    LAG(revenue) OVER w as prev_revenue
FROM sales
WINDOW w AS (ORDER BY date);  -- Named Window

Use Indexes

-- Index on PARTITION BY + ORDER BY columns
CREATE INDEX idx_sales_cat_date ON sales(category, date);

-- This query becomes faster
SELECT
    category,
    date,
    SUM(revenue) OVER (PARTITION BY category ORDER BY date) as cumsum
FROM sales;

Key Takeaways: 3 Principles of Window Functions

Aggregate While Preserving Rows: GROUP BY reduces rows, Window Functions preserve them
OVER() Is the Core: PARTITION BY for groups, ORDER BY for order, ROWS/RANGE for scope
LAG/LEAD for Comparisons: Month-over-month, week-over-week time series comparisons use LAG/LEAD

Remember these three things, and you can confidently answer "Yes" to "Can SQL do this?"