코호트 분석 SQL로 직접 구현하기: 리텐션의 모든 것
SOTAAZ·

코호트 분석 SQL로 직접 구현하기: 리텐션의 모든 것
GA4 없이도 코호트 분석 가능합니다. 가입월별 리텐션, N일 리텐션 SQL로 직접 구현.
TL;DR
- 코호트(Cohort): 특정 시점에 같은 행동을 한 사용자 그룹
- 리텐션: 코호트가 시간이 지나도 서비스를 계속 사용하는 비율
- SQL로 구현: Window 함수와 CTE로 GA4 없이 직접 분석 가능
- 핵심 인사이트: 어떤 코호트가 잔존율이 높은지, 언제 이탈이 심한지 파악
1. 코호트 분석이란?
왜 코호트 분석이 필요한가?
잘못된 분석:
이번 달 DAU: 10,000명 → "좋아!"올바른 분석:
1월 가입자 중 3개월 후 활성 사용자: 20%
2월 가입자 중 3개월 후 활성 사용자: 35%
→ "2월에 뭔가 바뀌었구나!"코호트 분석은 시간에 따른 사용자 행동 변화를 추적합니다.
코호트의 종류
2. 기본 테이블 구조
-- 사용자 테이블
CREATE TABLE users (
user_id INT PRIMARY KEY,
created_at TIMESTAMP, -- 가입일
signup_channel VARCHAR(50)
);
-- 활동 로그 테이블
CREATE TABLE user_activities (
id INT PRIMARY KEY,
user_id INT,
activity_type VARCHAR(50),
created_at TIMESTAMP
);
-- 주문 테이블
CREATE TABLE orders (
order_id INT PRIMARY KEY,
user_id INT,
amount DECIMAL(10, 2),
created_at TIMESTAMP
);3. 월별 가입 코호트 리텐션
가장 기본적인 코호트 분석
WITH user_cohorts AS (
-- 각 사용자의 가입월 (코호트)
SELECT
user_id,
DATE_TRUNC('month', created_at) as cohort_month
FROM users
),
user_activities_monthly AS (
-- 각 사용자의 월별 활동 여부
SELECT DISTINCT
user_id,
DATE_TRUNC('month', created_at) as activity_month
FROM user_activities
),
cohort_data AS (
-- 코호트와 활동 데이터 결합
SELECT
uc.cohort_month,
ua.activity_month,
-- 가입 후 몇 개월 차인지
DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
- DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
as months_since_signup,
COUNT(DISTINCT uc.user_id) as active_users
FROM user_cohorts uc
JOIN user_activities_monthly ua ON uc.user_id = ua.user_id
WHERE ua.activity_month >= uc.cohort_month
GROUP BY uc.cohort_month, ua.activity_month
),
cohort_sizes AS (
-- 각 코호트의 총 사용자 수
SELECT
cohort_month,
COUNT(*) as cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
cd.cohort_month,
cs.cohort_size,
cd.months_since_signup,
cd.active_users,
ROUND(cd.active_users * 100.0 / cs.cohort_size, 2) as retention_rate
FROM cohort_data cd
JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cd.cohort_month, cd.months_since_signup;결과:
4. 피벗 테이블 형태로 보기
코호트 리텐션 매트릭스
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', created_at) as cohort_month
FROM users
),
user_activities_monthly AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', created_at) as activity_month
FROM user_activities
),
retention_data AS (
SELECT
uc.cohort_month,
DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
- DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
as month_number,
COUNT(DISTINCT uc.user_id) as users
FROM user_cohorts uc
LEFT JOIN user_activities_monthly ua
ON uc.user_id = ua.user_id
AND ua.activity_month >= uc.cohort_month
GROUP BY uc.cohort_month, month_number
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) as size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
TO_CHAR(r.cohort_month, 'YYYY-MM') as cohort,
cs.size as cohort_size,
MAX(CASE WHEN month_number = 0 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M0",
MAX(CASE WHEN month_number = 1 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M1",
MAX(CASE WHEN month_number = 2 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M2",
MAX(CASE WHEN month_number = 3 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M3",
MAX(CASE WHEN month_number = 4 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M4",
MAX(CASE WHEN month_number = 5 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M5",
MAX(CASE WHEN month_number = 6 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M6"
FROM retention_data r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
GROUP BY r.cohort_month, cs.size
ORDER BY r.cohort_month;결과 (리텐션 매트릭스):
5. N일 리텐션 (Day 1, Day 7, Day 30)
모바일 앱에서 많이 사용하는 지표
WITH user_first_activity AS (
-- 각 사용자의 첫 활동일
SELECT
user_id,
DATE(MIN(created_at)) as first_activity_date
FROM user_activities
GROUP BY user_id
),
daily_activities AS (
-- 사용자별 활동 날짜
SELECT DISTINCT
user_id,
DATE(created_at) as activity_date
FROM user_activities
),
retention_flags AS (
SELECT
ufa.user_id,
ufa.first_activity_date,
-- Day N 활동 여부 체크
MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '1 day'
THEN 1 ELSE 0 END) as day_1,
MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '7 days'
THEN 1 ELSE 0 END) as day_7,
MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '14 days'
THEN 1 ELSE 0 END) as day_14,
MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '30 days'
THEN 1 ELSE 0 END) as day_30
FROM user_first_activity ufa
LEFT JOIN daily_activities da ON ufa.user_id = da.user_id
GROUP BY ufa.user_id, ufa.first_activity_date
)
SELECT
DATE_TRUNC('week', first_activity_date) as cohort_week,
COUNT(*) as cohort_size,
ROUND(SUM(day_1) * 100.0 / COUNT(*), 2) as day_1_retention,
ROUND(SUM(day_7) * 100.0 / COUNT(*), 2) as day_7_retention,
ROUND(SUM(day_14) * 100.0 / COUNT(*), 2) as day_14_retention,
ROUND(SUM(day_30) * 100.0 / COUNT(*), 2) as day_30_retention
FROM retention_flags
GROUP BY DATE_TRUNC('week', first_activity_date)
ORDER BY cohort_week;결과:
6. 구매 코호트: 재구매율 분석
첫 구매 후 재구매 패턴
WITH first_purchase AS (
-- 각 고객의 첫 구매
SELECT
user_id,
DATE_TRUNC('month', MIN(created_at)) as first_purchase_month
FROM orders
GROUP BY user_id
),
monthly_purchases AS (
-- 고객별 월별 구매 여부
SELECT DISTINCT
user_id,
DATE_TRUNC('month', created_at) as purchase_month
FROM orders
),
repurchase_data AS (
SELECT
fp.first_purchase_month as cohort,
DATE_PART('year', mp.purchase_month) * 12 + DATE_PART('month', mp.purchase_month)
- DATE_PART('year', fp.first_purchase_month) * 12 - DATE_PART('month', fp.first_purchase_month)
as months_since_first,
COUNT(DISTINCT fp.user_id) as customers
FROM first_purchase fp
JOIN monthly_purchases mp ON fp.user_id = mp.user_id
WHERE mp.purchase_month >= fp.first_purchase_month
GROUP BY fp.first_purchase_month, months_since_first
),
cohort_sizes AS (
SELECT first_purchase_month, COUNT(*) as size
FROM first_purchase
GROUP BY first_purchase_month
)
SELECT
TO_CHAR(rd.cohort, 'YYYY-MM') as cohort,
cs.size,
rd.months_since_first,
rd.customers,
ROUND(rd.customers * 100.0 / cs.size, 2) as repurchase_rate
FROM repurchase_data rd
JOIN cohort_sizes cs ON rd.cohort = cs.first_purchase_month
WHERE rd.months_since_first <= 6
ORDER BY rd.cohort, rd.months_since_first;7. 채널별 코호트 비교
어떤 채널의 사용자가 더 오래 남는가?
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', created_at) as cohort_month,
signup_channel
FROM users
),
user_activities_monthly AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', created_at) as activity_month
FROM user_activities
),
channel_retention AS (
SELECT
uc.signup_channel,
DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
- DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
as month_number,
COUNT(DISTINCT uc.user_id) as active_users,
COUNT(DISTINCT CASE WHEN ua.activity_month = uc.cohort_month
THEN uc.user_id END) as cohort_size
FROM user_cohorts uc
LEFT JOIN user_activities_monthly ua
ON uc.user_id = ua.user_id
AND ua.activity_month >= uc.cohort_month
GROUP BY uc.signup_channel, month_number
)
SELECT
signup_channel,
month_number,
active_users,
ROUND(active_users * 100.0 / NULLIF(MAX(cohort_size) OVER (PARTITION BY signup_channel), 0), 2) as retention_rate
FROM channel_retention
WHERE month_number BETWEEN 0 AND 6
ORDER BY signup_channel, month_number;결과:
→ Organic 유입이 유료 광고보다 리텐션이 2배!
8. 롤링 리텐션 vs 클래식 리텐션
클래식 리텐션
"정확히 Day 7에 접속한 사용자"
롤링 리텐션 (더 관대함)
"Day 7 이후 언제든 접속한 사용자"
WITH user_first_activity AS (
SELECT
user_id,
DATE(MIN(created_at)) as first_date
FROM user_activities
GROUP BY user_id
),
daily_activities AS (
SELECT DISTINCT user_id, DATE(created_at) as activity_date
FROM user_activities
),
retention_comparison AS (
SELECT
ufa.user_id,
ufa.first_date,
-- 클래식: 정확히 Day 7
MAX(CASE WHEN da.activity_date = ufa.first_date + 7 THEN 1 ELSE 0 END) as classic_d7,
-- 롤링: Day 7 이후 아무 때나
MAX(CASE WHEN da.activity_date >= ufa.first_date + 7 THEN 1 ELSE 0 END) as rolling_d7
FROM user_first_activity ufa
LEFT JOIN daily_activities da ON ufa.user_id = da.user_id
GROUP BY ufa.user_id, ufa.first_date
)
SELECT
DATE_TRUNC('week', first_date) as cohort_week,
COUNT(*) as users,
ROUND(SUM(classic_d7) * 100.0 / COUNT(*), 2) as classic_d7_retention,
ROUND(SUM(rolling_d7) * 100.0 / COUNT(*), 2) as rolling_d7_retention
FROM retention_comparison
GROUP BY DATE_TRUNC('week', first_date)
ORDER BY cohort_week;9. LTV by 코호트
코호트별 고객 생애 가치
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', created_at) as cohort_month
FROM users
),
user_revenue AS (
SELECT
user_id,
SUM(amount) as total_revenue,
COUNT(*) as order_count
FROM orders
GROUP BY user_id
)
SELECT
TO_CHAR(uc.cohort_month, 'YYYY-MM') as cohort,
COUNT(DISTINCT uc.user_id) as users,
COUNT(DISTINCT ur.user_id) as paying_users,
ROUND(COUNT(DISTINCT ur.user_id) * 100.0 / COUNT(DISTINCT uc.user_id), 2) as conversion_rate,
ROUND(AVG(ur.total_revenue), 2) as avg_revenue_per_paying_user,
ROUND(SUM(ur.total_revenue) / COUNT(DISTINCT uc.user_id), 2) as revenue_per_user
FROM user_cohorts uc
LEFT JOIN user_revenue ur ON uc.user_id = ur.user_id
GROUP BY uc.cohort_month
ORDER BY uc.cohort_month;10. 실전 팁
1. 코호트 크기 필터링
작은 코호트는 통계적으로 불안정합니다.
HAVING COUNT(*) >= 100 -- 최소 100명 이상2. 시각화를 위한 데이터 포맷
-- 히트맵 시각화용 long format
SELECT
cohort_month,
month_number,
retention_rate
FROM retention_data
ORDER BY cohort_month, month_number;3. 이탈률 = 100 - 리텐션
SELECT
cohort,
month_number,
retention_rate,
100 - retention_rate as churn_rate,
LAG(retention_rate) OVER (PARTITION BY cohort ORDER BY month_number) - retention_rate as monthly_churn
FROM retention_matrix;결론
코호트 분석 핵심:
- 시간에 따른 사용자 행동을 추적
- 평균이 아닌 그룹별 트렌드 파악
- 제품 개선의 효과를 정량적으로 측정
References
- Amplitude - Retention Analysis Guide
- Mixpanel - Cohort Analysis Best Practices
- Mode Analytics - SQL for Cohort Analysis