Crisis at DataFlow Analytics! Their e-commerce reporting system is grinding to a halt. Daily sales reports that should take minutes are taking hours. The CEO needs real-time dashboards, but queries are timing out. Your mission: Transform their perfectly normalized database into a high-performance reporting powerhouse - even if it means breaking some sacred normalization rules!
By the end of this lesson, you'll be able to:
-- Current normalized query: 15+ minute execution time!
SELECT
c.customer_name,
p.product_name,
cat.category_name,
SUM(oi.quantity * oi.unit_price) as total_sales,
COUNT(DISTINCT o.order_id) as order_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
JOIN addresses a ON c.address_id = a.address_id
JOIN cities ct ON a.city_id = ct.city_id
JOIN countries cn ON ct.country_id = cn.country_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, p.product_id, cat.category_id;
The Problem: Seven joins across normalized tables, scanning millions of rows for every report!
Transform your normalized snowflake into a star:
-- Step 1: Create a denormalized fact table
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
-- Foreign keys to dimension tables
date_key INT,
customer_key INT,
product_key INT,
-- Denormalized attributes for fast filtering
order_date DATE,
customer_name VARCHAR(100),
customer_country VARCHAR(50),
product_name VARCHAR(200),
category_name VARCHAR(50),
-- Measures
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(10,2),
-- Indexes for common queries
INDEX idx_date (order_date),
INDEX idx_customer (customer_key, order_date),
INDEX idx_product (product_key, order_date)
);
-- Now the same report runs in seconds!
SELECT
customer_name,
product_name,
category_name,
SUM(total_amount) as total_sales,
COUNT(*) as transaction_count
FROM fact_sales
WHERE order_date >= '2024-01-01'
GROUP BY customer_name, product_name, category_name;
-- Daily sales summary (updated nightly)
CREATE TABLE daily_sales_summary (
summary_date DATE,
customer_key INT,
product_category VARCHAR(50),
total_sales DECIMAL(12,2),
order_count INT,
unique_products INT,
PRIMARY KEY (summary_date, customer_key, product_category)
);
-- Monthly rollup for executive dashboards
CREATE TABLE monthly_sales_rollup (
year_month VARCHAR(7),
category_name VARCHAR(50),
total_revenue DECIMAL(15,2),
total_orders INT,
avg_order_value DECIMAL(10,2),
top_customer VARCHAR(100),
top_customer_revenue DECIMAL(12,2)
);
-- PostgreSQL materialized view for real-time-ish reporting
CREATE MATERIALIZED VIEW customer_360 AS
SELECT
c.customer_id,
c.customer_name,
c.email,
-- Denormalized location
CONCAT(ct.city_name, ', ', cn.country_name) as location,
-- Pre-calculated metrics
COUNT(DISTINCT o.order_id) as lifetime_orders,
SUM(o.total_amount) as lifetime_value,
AVG(o.total_amount) as avg_order_value,
MAX(o.order_date) as last_order_date,
-- Segmentation helpers
CASE
WHEN SUM(o.total_amount) > 10000 THEN 'VIP'
WHEN SUM(o.total_amount) > 1000 THEN 'Regular'
ELSE 'Occasional'
END as customer_segment
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN addresses a ON c.address_id = a.address_id
LEFT JOIN cities ct ON a.city_id = ct.city_id
LEFT JOIN countries cn ON ct.country_id = cn.country_id
GROUP BY c.customer_id, c.customer_name, c.email, ct.city_name, cn.country_name;
-- Refresh strategy
CREATE INDEX idx_customer_360_segment ON customer_360(customer_segment);
REFRESH MATERIALIZED VIEW CONCURRENTLY customer_360;
-- Query complexity check
EXPLAIN ANALYZE
SELECT /* your reporting query */;
-- Look for these red flags:
-- ✗ Multiple nested loops
-- ✗ Full table scans on large tables
-- ✗ Join cardinality explosions
-- ✗ Execution time > 1 second for dashboards
Is this a reporting/analytics workload?
├─ YES → Is query performance critical?
│ ├─ YES → Are there many joins (>3)?
│ │ ├─ YES → Consider denormalization!
│ │ └─ NO → Try indexing first
│ └─ NO → Keep it normalized
└─ NO → Is it OLTP?
└─ YES → Stay normalized!
-- Stored procedure for nightly fact table refresh
CREATE OR REPLACE PROCEDURE refresh_fact_sales()
LANGUAGE plpgsql
AS $
BEGIN
-- Clear today's data
DELETE FROM fact_sales WHERE order_date = CURRENT_DATE - 1;
-- Insert denormalized data
INSERT INTO fact_sales
SELECT
o.order_id * 10000 + oi.line_item_id as sale_id,
TO_CHAR(o.order_date, 'YYYYMMDD')::INT as date_key,
c.customer_id as customer_key,
p.product_id as product_key,
o.order_date,
c.customer_name,
cn.country_name as customer_country,
p.product_name,
cat.category_name,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price as total_amount
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
JOIN addresses a ON c.address_id = a.address_id
JOIN cities ct ON a.city_id = ct.city_id
JOIN countries cn ON ct.country_id = cn.country_id
WHERE o.order_date = CURRENT_DATE - 1;
-- Update aggregates
PERFORM refresh_daily_summaries();
RAISE NOTICE 'Fact table refreshed: % rows', ROW_COUNT;
END;
$;
-- Schedule it!
-- In production, use cron or your scheduler:
-- 0 2 * * * psql -c "CALL refresh_fact_sales();"
-- Incremental aggregate refresh
CREATE OR REPLACE FUNCTION refresh_daily_summaries()
RETURNS void AS $
BEGIN
-- Use UPSERT for idempotency
INSERT INTO daily_sales_summary
SELECT
order_date as summary_date,
customer_key,
category_name as product_category,
SUM(total_amount) as total_sales,
COUNT(DISTINCT sale_id) as order_count,
COUNT(DISTINCT product_key) as unique_products
FROM fact_sales
WHERE order_date >= CURRENT_DATE - 7 -- Last week only
GROUP BY order_date, customer_key, category_name
ON CONFLICT (summary_date, customer_key, product_category)
DO UPDATE SET
total_sales = EXCLUDED.total_sales,
order_count = EXCLUDED.order_count,
unique_products = EXCLUDED.unique_products;
END;
$ LANGUAGE plpgsql;
-- Create a view that automatically uses the right table
CREATE OR REPLACE VIEW sales_report AS
SELECT
COALESCE(ds.summary_date, fs.order_date) as report_date,
COALESCE(ds.customer_key, fs.customer_key) as customer_key,
COALESCE(ds.product_category, fs.category_name) as category,
COALESCE(ds.total_sales, SUM(fs.total_amount)) as total_sales,
COALESCE(ds.order_count, COUNT(DISTINCT fs.sale_id)) as orders
FROM daily_sales_summary ds
FULL OUTER JOIN fact_sales fs
ON ds.summary_date = fs.order_date
AND ds.customer_key = fs.customer_key
AND ds.product_category = fs.category_name
WHERE ds.summary_date IS NOT NULL
OR fs.order_date >= CURRENT_DATE - 30
GROUP BY 1, 2, 3, ds.total_sales, ds.order_count;
GAINS:
✓ 100x faster queries
✓ Predictable performance
✓ Happy executives with real-time dashboards
✓ Reduced database load
COSTS:
✗ Storage space (2-3x)
✗ Data freshness lag
✗ ETL complexity
✗ Potential inconsistencies
-- Problem: Denormalized data gets out of sync
-- Solution: Validation queries
CREATE OR REPLACE FUNCTION validate_denormalization()
RETURNS TABLE(check_name TEXT, is_valid BOOLEAN, details TEXT) AS $
BEGIN
-- Check 1: Fact table completeness
RETURN QUERY
SELECT
'Fact table completeness'::TEXT,
NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.order_date >= CURRENT_DATE - 7
AND NOT EXISTS (
SELECT 1 FROM fact_sales fs
WHERE fs.order_date = o.order_date
)
),
'Missing orders: ' || COUNT(*)::TEXT
FROM orders o
WHERE o.order_date >= CURRENT_DATE - 7
AND NOT EXISTS (
SELECT 1 FROM fact_sales fs
WHERE fs.order_date = o.order_date
);
-- Check 2: Aggregate accuracy
RETURN QUERY
WITH fact_totals AS (
SELECT
order_date,
SUM(total_amount) as fact_total
FROM fact_sales
WHERE order_date >= CURRENT_DATE - 7
GROUP BY order_date
),
summary_totals AS (
SELECT
summary_date,
SUM(total_sales) as summary_total
FROM daily_sales_summary
WHERE summary_date >= CURRENT_DATE - 7
GROUP BY summary_date
)
SELECT
'Aggregate accuracy'::TEXT,
COALESCE(MAX(ABS(f.fact_total - s.summary_total)), 0) < 0.01,
'Max difference:
## Real-World War Stories
### Case Study: Netflix's Viewing Analytics
Netflix processes billions of viewing events daily. Their solution:
```sql
-- Simplified version of their approach
-- Raw events → Hourly aggregates → Daily rollups → Monthly summaries
-- Level 1: Raw events (kept for 7 days)
CREATE TABLE viewing_events (
event_id BIGINT,
user_id INT,
content_id INT,
timestamp TIMESTAMP,
duration_seconds INT
) PARTITION BY RANGE (timestamp);
-- Level 2: Hourly aggregates (kept for 90 days)
CREATE TABLE hourly_viewing (
hour_bucket TIMESTAMP,
user_id INT,
content_category VARCHAR(50),
total_minutes DECIMAL(10,2),
session_count INT,
PRIMARY KEY (hour_bucket, user_id, content_category)
);
-- Level 3: Daily profiles (kept forever)
CREATE TABLE daily_user_profile (
profile_date DATE,
user_id INT,
total_watch_time INT,
favorite_category VARCHAR(50),
recommendation_score DECIMAL(5,2)
);
-- OLTP: Real-time order processing (normalized)
-- Data Lake: Raw event storage (semi-structured)
-- Data Warehouse: Star schema (denormalized)
-- Caching Layer: Redis/DynamoDB (ultra-denormalized)
-- Each serves its purpose!
-- Performance testing harness
CREATE OR REPLACE FUNCTION compare_query_performance(
normalized_query TEXT,
denormalized_query TEXT
) RETURNS TABLE(
query_type TEXT,
execution_time INTERVAL,
rows_returned BIGINT
) AS $
DECLARE
start_time TIMESTAMP;
end_time TIMESTAMP;
row_count BIGINT;
BEGIN
-- Test normalized
start_time := clock_timestamp();
EXECUTE normalized_query;
GET DIAGNOSTICS row_count = ROW_COUNT;
end_time := clock_timestamp();
RETURN QUERY SELECT
'Normalized'::TEXT,
end_time - start_time,
row_count;
-- Test denormalized
start_time := clock_timestamp();
EXECUTE denormalized_query;
GET DIAGNOSTICS row_count = ROW_COUNT;
end_time := clock_timestamp();
RETURN QUERY SELECT
'Denormalized'::TEXT,
end_time - start_time,
row_count;
END;
$ LANGUAGE plpgsql;
DENORMALIZATION PATTERNS:
1. **Star Schema**: Facts + Dimensions
2. **Snowflake**: Normalized dimensions
3. **Galaxy**: Multiple fact tables
4. **Data Vault**: Hubs, Links, Satellites
WHEN TO USE EACH:
- Star: Simple reporting needs
- Snowflake: Storage is expensive
- Galaxy: Multiple business processes
- Data Vault: Regulatory compliance
REFRESH STRATEGIES:
- Full refresh: Simple but slow
- Incremental: Complex but fast
- CDC (Change Data Capture): Real-time
- Streaming: Lowest latency
|| COALESCE(MAX(ABS(f.fact_total - s.summary_total)), 0)::TEXT FROM fact_totals f JOIN summary_totals s ON f.order_date = s.summary_date; END; $ LANGUAGE plpgsql;
-- Run validation after ETL SELECT * FROM validate_denormalization();
## Real-World War Stories
### Case Study: Netflix's Viewing Analytics
Netflix processes billions of viewing events daily. Their solution:
__CODE_BLOCK_11__
### The Amazon Approach: Multiple Truth Sources
__CODE_BLOCK_12__
## Your Denormalization Toolkit
__CODE_BLOCK_13__
## Quick Reference Card
__CODE_BLOCK_14__