Demo Mode

No student ID available

Template 5 of 7

Template 5: Data Preparation

Student starter code (30% baseline)

📦 Project Files Included:

📄index.html- Main HTML page
📜script.js- JavaScript logic
🎨styles.css- Styling and layout
📦package.json- Dependencies
⚙️setup.sh- Setup script
📖README.md- Instructions (below)

💡 Download the ZIP, extract it, and follow the instructions below to get started!

Activity 05: Data Preparation - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Data Cleaning: Remove unwanted columns, handle duplicates, and manage missing values
Data Preprocessing: Apply label encoding to categorical data and MinMax scaling to numerical data
Train-Test Splitting: Properly split datasets for machine learning model training and evaluation
Real-World Application: Work with actual datasets including video game sales, Google Play Store apps, and medical data

🚀 Getting Started (See Results in 30 Seconds!)

Explore the Complete Solution First

Before starting the challenge, explore the complete working solution to understand what you're building toward:

bash

# Open the complete baseline notebook
jupyter notebook v1-baseline-100percent.ipynb

What to observe:

Part 1: How data cleaning removes unwanted information
Part 2: How encoding transforms categories to numbers, and scaling normalizes values
Part 3: How train-test splitting prepares data for machine learning
Run all cells (Cell -> Run All) to see the complete workflow

Start Your Challenge

Once you've explored the baseline, open your challenge notebook:

bash

# Open the student template
jupyter notebook activity-05-data-preparation.ipynb

📋 What's Already Working

Your template comes with 66.7% working code including:

✅ Complete Examples Provided

Data loading: All dataset imports and initial setup
Data exploration: Commands to view and understand datasets
Example implementations: First instances of each technique (drop, dropna, etc.)
Validation code: Cells to check your work

✅ Supporting Code

All necessary library imports (pandas, sklearn, numpy)
Sample datasets for practice
Output verification cells
Expected results documentation

📝 Your Tasks (19 TODOs to Complete)

You need to implement 33.3% of the code across three main sections:

Part One: Data Cleaning (6 TODOs - Easy)

Work with a sample student dataset and video game sales data:

TODO	Task	Difficulty	Skills
1	Remove Gender Column	Easy	DataFrame manipulation
2	Remove Duplicate Rows	Easy	Duplicate handling
3	Remove Missing Values	Easy	Missing data management
4	Remove Multiple Columns	Easy	Bulk column operations
5	Check Missing Values	Easy	Data inspection
6	Clean Real Dataset	Easy	Applying cleaning pipeline

Key Concepts: drop(), drop_duplicates(), dropna(), isnull().sum()

Part 2: Data Preprocessing (4 TODOs - Medium)

Learn to transform data for machine learning:

TODO	Task	Difficulty	Skills
7	Setup Label Encoder	Medium	Categorical encoding
8	Transform Country Data	Medium	Applying encoders
9	Setup MinMaxScaler	Medium	Numerical scaling
10	Scale Score Data	Medium	Applying scalers

Key Concepts: LabelEncoder, MinMaxScaler, fit(), transform()

Part 3: Train-Test Splitting (9 TODOs - Medium/Hard)

Apply all concepts in realistic machine learning scenarios:

TODO	Task	Difficulty	Skills
11	Split Sample Data	Medium	Basic train-test split
12	Setup Gender Encoder	Easy	Encoder repetition
13	Encode Gender Data	Easy	Transform practice
14	Setup Category Encoder	Medium	Real dataset encoding
15	Encode Categories	Medium	Column transformation
16	Setup Rating Scaler	Medium	Real dataset scaling
17	Scale Ratings	Medium	Column scaling
18	Split GPS Data	Hard	Integration challenge
19	Split Medical Data	Hard	Full pipeline

Key Concepts: train_test_split(), data preparation pipelines, 80/20 splits

🎓 Learning Progression

The TODOs are designed to build your skills progressively:

Foundation (TODOs 1-6): Basic data cleaning operations
Core Skills (TODOs 7-13): Encoding and scaling individual datasets
Integration (TODOs 14-19): Combining multiple techniques in realistic scenarios

✅ Success Criteria

For each TODO, you'll find:

TASK: Clear description of what to implement
SUCCESS CRITERIA: Specific outcomes to achieve
HINTS: Technical guidance and patterns to follow

Example TODO Structure:

python

# ⚠️ TODO 7: Setup Label Encoder (Medium)
#
# TASK: Import LabelEncoder, create instance, fit with country_data, print classes
#
# SUCCESS CRITERIA:
# - Correctly import required classes from sklearn
# - Create instance and fit with appropriate data
# - Print classes or parameters to verify
#
# HINTS:
# - Pattern: from sklearn.preprocessing import LabelEncoder
#
# Your code here:

📊 Expected Outcomes

After completing all TODOs, you will:

Clean Data: Remove 40% of unwanted data from video game dataset
Encode Categories: Transform country names -> numeric codes (0-4)
Scale Values: Convert scores (0-100) -> normalized values (0.0-1.0)
Split Datasets: Create train (80%) and test (20%) sets for 3 different datasets

🚀 Extension Challenges

Completed all TODOs? Try these advanced challenges:

Custom Cleaning Pipeline: Create a function that automates the complete cleaning process
Different Scaling Methods: Try StandardScaler or RobustScaler instead of MinMaxScaler
Stratified Splitting: Use stratification to ensure balanced classes in train/test sets
Feature Engineering: Create new features before encoding (e.g., combining columns)
Pipeline Object: Use sklearn's Pipeline to chain preprocessing steps

🔧 Technical Requirements

Dependencies

python

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

Datasets Used

Sample Student Data: Small DataFrame for learning basics
Video Game Sales (vgsales.csv): 16,598 games with sales data
Google Play Store (googleplaystore.csv): 10,841 apps with ratings
Breast Cancer (breastcancer.csv): Medical dataset for classification

All datasets download automatically when running the notebook.

💡 Tips for Success

General Approach

Read the v1 baseline first: See working examples of every technique
Start with Easy TODOs: Build confidence with foundational tasks
Test incrementally: Run cells after each TODO to verify correctness
Compare outputs: Check your results against expected outputs in comments
Use print statements: Verify data shape and content after transformations

Common Pitfalls

❌ Mistake: Forgetting to reassign transformed data

python

# Wrong
encoder.fit(data)
# Data is not encoded yet!

# Correct
data_encoded = encoder.transform(data)

❌ Mistake: Using wrong shape for scaling

python

# Wrong
scaler.fit(df['column'])  # 1D array

# Correct
scaler.fit(df[['column']])  # 2D DataFrame

❌ Mistake: Not setting random_state

python

# Non-reproducible
train_test_split(X, y, test_size=0.2)

# Reproducible
train_test_split(X, y, test_size=0.2, random_state=42)

Debugging Checklist

When code doesn't work:

Did you import the required class?
Did you create an instance (e.g., encoder = LabelEncoder())?
Did you fit before transform?
Did you use correct variable names?
Did you reassign the result?
Is your data the correct shape (1D vs 2D)?

📚 Resources

Documentation

Key Methods

Method	Purpose	Example
`df.drop()`	Remove columns/rows	`df.drop(columns=['col1'])`
`df.dropna()`	Remove missing values	`df.dropna()`
`LabelEncoder`	Encode categories	`encoder.fit_transform(data)`
`MinMaxScaler`	Scale to 0-1 range	`scaler.fit_transform(data)`
`train_test_split`	Split data	`X_train, X_test, y_train, y_test = train_test_split(X, y)`

🎯 Assessment

You'll know you've succeeded when:

All 19 TODOs completed with working code
Notebook runs from top to bottom without errors
Outputs match expected results in comments
You can explain what each preprocessing step accomplishes
You understand when to use encoding vs scaling

🆘 Getting Help

Review the v1 baseline: The complete solution shows working examples
Check error messages: They often tell you exactly what's wrong
Print intermediate results: Use print() to see what your data looks like
Read the hints: Each TODO includes specific guidance
Consult documentation: Links provided in Resources section

Estimated Time: 90-120 minutes

Difficulty Breakdown:

Easy: 8 TODOs (40-50 minutes)
Medium: 9 TODOs (30-45 minutes)
Hard: 2 TODOs (20-25 minutes)

Good luck with your data preparation challenge! 🚀