"We Need Python for This" — Handling Pivot, JSON, UTM, RFM All in SQL

"We Need Python for This" — Handling Pivot, JSON, UTM, RFM All in SQL
Python 100 Lines vs SQL One Query
"We need to build ETL in Python" — every time I hear this, I wonder: Set up Airflow, manage dependencies, configure schedulers... Do we really need all that?
80% of data processing tasks in real work can be done with SQL alone.
This article covers patterns for handling Pivot, JSON parsing, UTM extraction, and RFM segmentation — tasks commonly thought to "need Python" — with just SQL.
1. SQL vs Python: When to Choose What
When SQL is Better
- Aggregation: DB engine parallel processing beats pandas
- Pivot/Unpivot: Native support in BigQuery, Snowflake
- JSON parsing: Built-in functions like JSON_EXTRACT, JSON_VALUE
- Regex extraction: REGEXP_EXTRACT for UTM, etc.
- Large joins: Distributed processing, broadcast joins
When You Need Python
- External APIs: REST APIs, web scraping
- ML models: Predictions, clustering
- Complex NLP: Tokenization, morphological analysis
- Stateful processing: Cumulative calculations dependent on previous rows
Cost Comparison
2. Pivot — Transform Rows to Report Format
CASE WHEN Method (Universal)
SELECT
category,
COALESCE(SUM(CASE WHEN MONTH(order_date) = 1 THEN revenue END), 0) AS jan,
COALESCE(SUM(CASE WHEN MONTH(order_date) = 2 THEN revenue END), 0) AS feb,
COALESCE(SUM(CASE WHEN MONTH(order_date) = 3 THEN revenue END), 0) AS mar
FROM orders
GROUP BY category;Key: Use COALESCE for NULL defense. NULL + number = NULL can distort aggregations.
BigQuery PIVOT Operator
SELECT * FROM (
SELECT category, EXTRACT(MONTH FROM order_date) AS month, revenue
FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2024-04-01'
)
PIVOT (
SUM(revenue)
FOR month IN (1 AS jan, 2 AS feb, 3 AS mar)
)
ORDER BY category;Native PIVOT provides 10-30% performance improvement over CASE WHEN.
3. JSON Parsing — Handling Web Event Logs
Basic Parsing
SELECT
user_id,
JSON_EXTRACT_SCALAR(event_params, '$.page_url') AS page_url,
JSON_EXTRACT_SCALAR(event_params, '$.campaign') AS campaign,
JSON_EXTRACT_SCALAR(event_params, '$.device') AS device
FROM events
WHERE DATE(event_timestamp) = '2024-01-15';Performance: Parse Once with CTE
WITH parsed AS (
SELECT *, JSON_EXTRACT_SCALAR(event_params, '$.page_url') AS page_url
FROM events
)
SELECT * FROM parsed WHERE page_url LIKE '%/products%';Nested JSON Arrays
WITH order_items AS (
SELECT order_id, item_json
FROM orders, UNNEST(JSON_EXTRACT_ARRAY(items, '$')) AS item_json
)
SELECT
order_id,
JSON_EXTRACT_SCALAR(item_json, '$.product_id') AS product_id,
CAST(JSON_EXTRACT_SCALAR(item_json, '$.qty') AS INT64) AS quantity
FROM order_items;4. UTM Extraction — Automate Channel Analysis
Regex Extraction
WITH url_normalized AS (
SELECT session_id,
LOWER(REGEXP_REPLACE(
REGEXP_REPLACE(landing_page, r'#.*$', ''),
r'%20', ' '
)) AS clean_url
FROM sessions
)
SELECT
session_id,
REGEXP_EXTRACT(clean_url, r'utm_source=([^&]+)') AS utm_source,
REGEXP_EXTRACT(clean_url, r'utm_medium=([^&]+)') AS utm_medium,
REGEXP_EXTRACT(clean_url, r'utm_campaign=([^&]+)') AS utm_campaign
FROM url_normalized;Channel Grouping
CASE
WHEN utm_medium IN ('cpc', 'ppc', 'paidsearch') THEN 'Paid Search'
WHEN utm_medium IN ('email', 'newsletter') THEN 'Email'
WHEN utm_source IN ('facebook', 'instagram') AND utm_medium = 'cpc' THEN 'Paid Social'
WHEN utm_source IS NULL THEN 'Direct'
ELSE 'Other'
END AS channel_group5. RFM Segmentation — Auto Customer Classification
What is RFM
- R (Recency): When was the last purchase?
- F (Frequency): How often?
- M (Monetary): How much?
RFM in SQL
WITH customer_metrics AS (
SELECT
customer_id,
DATE_DIFF(CURRENT_DATE(), MAX(order_date), DAY) AS recency_days,
COUNT(DISTINCT order_id) AS frequency,
SUM(order_amount) AS monetary
FROM orders
WHERE order_status = 'completed'
AND order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
GROUP BY customer_id
),
rfm_scores AS (
SELECT
customer_id,
recency_days, frequency, monetary,
NTILE(5) OVER (ORDER BY recency_days DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM customer_metrics
)
SELECT
customer_id,
CONCAT(r_score, f_score, m_score) AS rfm_score,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN f_score >= 4 AND m_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers'
WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
WHEN r_score = 1 AND f_score <= 2 THEN 'Lost'
ELSE 'Others'
END AS segment
FROM rfm_scores;6. Production ETL — Build Data Marts with Scheduled Queries
ELT Paradigm
ETL became ELT (Extract-Load-Transform). In BigQuery, Snowflake, load raw data first, then transform with SQL.
Layered Architecture
- Raw: Store as-is
- Clean: NULL handling, JSON parsing, UTM extraction
- Aggregate: Daily/weekly aggregations
- Mart: Final tables per business domain
7. Common Mistakes
Missing NULL Handling
SELECT
SUM(COALESCE(revenue, 0)) AS total,
COUNTIF(revenue IS NULL) AS null_count
FROM orders;Ignoring Timezones
SELECT DATE(event_timestamp, 'Asia/Seoul') AS date FROM events;Conclusion
Before saying "we need Python", check if SQL can do it first.
A well-written SQL query beats 100 lines of Python:
- Easier to maintain
- Better performance
- Lower cost