Data & Analytics🇰🇷 한국어

"Can SQL Do This?" — Escaping Subquery Hell with Window Functions

LAG, LEAD, RANK for month-over-month, rankings, and running totals

"Can SQL Do This?" — Escaping Subquery Hell with Window Functions

"Can SQL Do This?" — Escaping Subquery Hell with Window Functions

Escaping Subquery Hell

"Get me month-over-month growth rate", "Top 3 products in each category", "Cohort retention analysis"...

Ever received these requests and ended up nesting subqueries inside subqueries, inside more subqueries, until you created a query even you couldn't understand? Master Window functions properly, and complex analytical queries become clean and fast.

1. What Are Window Functions? — 1 Minute Explanation

Subquery Approach vs Window Function Approach

"Show me each employee's salary alongside their department average salary"

With subquery:

sql
SELECT
    e.name,
    e.department,
    e.salary,
    (SELECT AVG(salary) FROM employees e2
     WHERE e2.department = e.department) as dept_avg
FROM employees e;

The subquery executes for every row. 10,000 rows means 10,000 executions.

With Window function:

sql
SELECT
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) as dept_avg
FROM employees;

Calculated once. Same result, much faster.

Key Difference

GROUP BYWindow Function
Result rowsReduced to group countOriginal rows preserved
Original columnsCan't use if not in GROUP BYAll columns available
Aggregate values1 per groupAdded to every row

2. OVER() Clause Breakdown

The core of Window functions is the OVER() clause. It has three parts:

sql
SUM(revenue) OVER (
    PARTITION BY category    -- Within which group?
    ORDER BY date            -- In what order?
    ROWS BETWEEN             -- Over what range?
        UNBOUNDED PRECEDING
        AND CURRENT ROW
)

PARTITION BY — Dividing Groups

sql
-- Each product's share of total and category revenue
SELECT
    product_name,
    revenue,
    revenue * 100.0 / SUM(revenue) OVER () as total_share,           -- vs Total
    revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category) as category_share  -- vs Category
FROM products;

Result:

productrevenuetotal_sharecategory_share
iPhone500,00025.0%45.5%
Galaxy400,00020.0%36.4%
Pixel200,00010.0%18.1%
Nike Shoes300,00015.0%42.9%
Adidas Shoes400,00020.0%57.1%

ORDER BY — Setting the Order

Adding ORDER BY enables cumulative calculations:

sql
-- Daily cumulative revenue
SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date) as cumulative_revenue
FROM daily_sales;

Result:

datedaily_revenuecumulative_revenue
2024-01-01100,000100,000
2024-01-02150,000250,000
2024-01-03120,000370,000
2024-01-04180,000550,000

3. LAG/LEAD — The Core of Month-over-Month Calculations

Basic Usage

sql
LAG(column, N, default)  -- Value N rows back (default 1)
LEAD(column, N, default) -- Value N rows forward (default 1)

Month-over-Month Growth Rate (MoM)

sql
WITH monthly_revenue AS (
    SELECT
        strftime('%Y-%m', order_date) as month,
        SUM(total_amount) as revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY strftime('%Y-%m', order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) as prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) as diff,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
          LAG(revenue) OVER (ORDER BY month), 1) as growth_pct
FROM monthly_revenue
ORDER BY month;

Result:

monthrevenueprev_monthdiffgrowth_pct
2024-0152,000,000NULLNULLNULL
2024-0259,000,00052,000,0007,000,00013.5%
2024-0354,000,00059,000,000-5,000,000-8.5%
2024-0461,000,00054,000,0007,000,00013.0%

Week-over-Week Same Day (WoW)

sql
SELECT
    date,
    revenue,
    LAG(revenue, 7) OVER (ORDER BY date) as same_day_last_week,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth
FROM daily_sales
WHERE date >= DATE('now', '-14 days');

4. The Ranking Trio — When to Use Which?

ROW_NUMBER vs RANK vs DENSE_RANK

sql
SELECT
    product_name,
    revenue,
    ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num,
    RANK() OVER (ORDER BY revenue DESC) as rank,
    DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank
FROM products;

Result (with ties):

productrevenueROW_NUMBERRANKDENSE_RANK
A1,000,000111
B900,000222
C900,000322
D800,000443

Differences:

  • ROW_NUMBER: Always sequential numbers (ignores ties)
  • RANK: Ties get same rank, next rank skipped (1,2,2,4)
  • DENSE_RANK: Ties get same rank, next rank not skipped (1,2,2,3)

When to Use Which?

ROW_NUMBER — When you need exactly N items

sql
-- Top 3 products per category (exactly 3)
WITH ranked AS (
    SELECT
        category,
        product_name,
        revenue,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
    FROM products
)
SELECT * FROM ranked WHERE rn <= 3;

RANK — When you want to include all tied entries

sql
-- Top 10, but include all if 10th place has ties
WITH ranked AS (
    SELECT product_name, revenue,
           RANK() OVER (ORDER BY revenue DESC) as rank
    FROM products
)
SELECT * FROM ranked WHERE rank <= 10;
-- Returns 12 rows if 3 items are tied for 10th

DENSE_RANK — When you want no gaps in ranking numbers

sql
-- 1st, 2nd, 3rd... without skipping
SELECT product_name, revenue,
       DENSE_RANK() OVER (ORDER BY revenue DESC) as rank
FROM products;

5. Moving Averages and Cumulative Totals

7-Day Moving Average

sql
SELECT
    date,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7
FROM daily_sales;

Result:

datedaily_revenuema7
2024-01-01100,000100,000
2024-01-02150,000125,000
2024-01-03120,000123,333
.........
2024-01-07140,000130,000
2024-01-08160,000135,714

Monthly Cumulative Revenue (Reset Each Month)

sql
SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY strftime('%Y-%m', date)  -- Reset monthly
        ORDER BY date
    ) as mtd_revenue  -- Month-to-Date
FROM daily_sales;

Result:

datedaily_revenuemtd_revenue
2024-01-01100,000100,000
2024-01-02150,000250,000
2024-01-31120,0004,500,000
2024-02-01140,000140,000 ← Reset!
2024-02-02160,000300,000

6. ROWS vs RANGE — Subtle but Important Difference

Different Behavior with Duplicate Dates

sql
-- Data: Multiple orders on same date
-- | date       | amount |
-- |------------|--------|
-- | 2024-01-01 | 100    |
-- | 2024-01-01 | 150    |  ← Same date
-- | 2024-01-02 | 200    |

-- ROWS: By physical row order
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) as rows_sum
FROM orders;
-- Result: 100, 250, 450

-- RANGE: Groups same values together
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date RANGE UNBOUNDED PRECEDING) as range_sum
FROM orders;
-- Result: 250, 250, 450  ← Same dates get same total!

Pro Tip: In most cases, ROWS is more intuitive. Use RANGE only when you need "same values processed together."

7. Practical: Cohort Retention Analysis

Group customers by their first purchase month and analyze how many repurchased in subsequent months:

sql
WITH customer_cohort AS (
    -- Customer's first purchase month
    SELECT
        customer_id,
        strftime('%Y-%m', MIN(order_date)) as cohort_month
    FROM orders WHERE status = 'completed'
    GROUP BY customer_id
),
monthly_activity AS (
    -- Customer activity by month
    SELECT DISTINCT
        o.customer_id,
        cc.cohort_month,
        strftime('%Y-%m', o.order_date) as activity_month
    FROM orders o
    JOIN customer_cohort cc ON o.customer_id = cc.customer_id
    WHERE o.status = 'completed'
)
SELECT
    cohort_month,
    COUNT(DISTINCT customer_id) as M0,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+1 month')) THEN customer_id END) as M1,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+2 month')) THEN customer_id END) as M2,
    COUNT(DISTINCT CASE WHEN activity_month =
        strftime('%Y-%m', DATE(cohort_month || '-01', '+3 month')) THEN customer_id END) as M3
FROM monthly_activity
GROUP BY cohort_month
ORDER BY cohort_month;

Result:

cohortM0M1M2M3
2024-0123489 (38%)67 (29%)52 (22%)
2024-02267102 (38%)78 (29%)-
2024-03298115 (39%)--

8. Organizing Complex Queries with CTEs

Subquery Hell

sql
SELECT * FROM (
    SELECT * FROM (
        SELECT customer_id, SUM(amount) as total
        FROM orders GROUP BY customer_id
    ) WHERE total > 100000
) WHERE customer_id IN (SELECT customer_id FROM vip_list);

Hard to read, right?

Organized with CTEs

sql
WITH customer_totals AS (
    SELECT customer_id, SUM(amount) as total
    FROM orders
    GROUP BY customer_id
),
high_value AS (
    SELECT * FROM customer_totals
    WHERE total > 100000
)
SELECT hv.*
FROM high_value hv
WHERE hv.customer_id IN (SELECT customer_id FROM vip_list);

Naming each step makes it easier to read and debug.

9. Practical Example: Daily KPI Dashboard

sql
WITH daily_metrics AS (
    SELECT
        DATE(order_date) as date,
        SUM(total_amount) as revenue,
        COUNT(*) as orders,
        COUNT(DISTINCT customer_id) as customers
    FROM orders
    WHERE status = 'completed'
      AND order_date >= DATE('now', '-30 days')
    GROUP BY DATE(order_date)
)
SELECT
    date,
    revenue,
    orders,
    customers,

    -- Day-over-day
    LAG(revenue) OVER (ORDER BY date) as prev_day_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue) OVER (ORDER BY date), 1) as dod_growth,

    -- Week-over-week (same day)
    LAG(revenue, 7) OVER (ORDER BY date) as week_ago_revenue,
    ROUND((revenue - LAG(revenue, 7) OVER (ORDER BY date)) * 100.0 /
          LAG(revenue, 7) OVER (ORDER BY date), 1) as wow_growth,

    -- 7-day moving average
    ROUND(AVG(revenue) OVER (
        ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) as ma7

FROM daily_metrics
ORDER BY date DESC
LIMIT 14;

Result:

daterevenuedodwowma7
06-152,340,000+5.2%+12.3%2,156,000
06-142,225,000-3.1%+8.7%2,098,000
06-132,296,000+8.4%+15.2%2,045,000

10. Performance Tips: When Window Functions Are Slow

Problem: Multiple Different ORDER BYs Are Slow

sql
-- ❌ Different sort each time → Slow
SELECT
    SUM(revenue) OVER (ORDER BY date) as sum1,
    AVG(revenue) OVER (ORDER BY category) as avg1,  -- Different sort
    LAG(revenue) OVER (ORDER BY product_id) as lag1 -- Another different sort
FROM sales;

Solution: Reuse the Same OVER Clause

sql
-- ✅ Same sort shared → Fast
SELECT
    SUM(revenue) OVER w as running_sum,
    AVG(revenue) OVER w as running_avg,
    LAG(revenue) OVER w as prev_revenue
FROM sales
WINDOW w AS (ORDER BY date);  -- Named Window

Use Indexes

sql
-- Index on PARTITION BY + ORDER BY columns
CREATE INDEX idx_sales_cat_date ON sales(category, date);

-- This query becomes faster
SELECT
    category,
    date,
    SUM(revenue) OVER (PARTITION BY category ORDER BY date) as cumsum
FROM sales;

Key Takeaways: 3 Principles of Window Functions

  1. Aggregate While Preserving Rows: GROUP BY reduces rows, Window Functions preserve them
  2. OVER() Is the Core: PARTITION BY for groups, ORDER BY for order, ROWS/RANGE for scope
  3. LAG/LEAD for Comparisons: Month-over-month, week-over-week time series comparisons use LAG/LEAD

Remember these three things, and you can confidently answer "Yes" to "Can SQL do this?"

Appendix: Window Function Quick Reference

FunctionPurposeExample Use
ROW_NUMBER()Sequential numbersTOP N, deduplication
RANK()Ranking (skips on ties)1,2,2,4
DENSE_RANK()Ranking (no skipping)1,2,2,3
LAG(col, n)Value n rows backMonth-over-month
LEAD(col, n)Value n rows forwardNext month forecast comparison
SUM() OVERCumulative/moving sumRunning total
AVG() OVERMoving average7-day MA
FIRST_VALUE()First value in partitionCategory leader
NTILE(n)Divide into n groupsTop 25% customers

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts