Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)💡 Download the ZIP, extract it, and follow the instructions below to get started!
By completing this activity, you will:
Before starting the challenge, explore the complete working solution to understand what you're building toward:
# Open the complete baseline notebook
jupyter notebook v1-baseline-100percent.ipynb
What to observe:
Once you've explored the baseline, open your challenge notebook:
# Open the student template
jupyter notebook activity-05-data-preparation.ipynb
Your template comes with 66.7% working code including:
You need to implement 33.3% of the code across three main sections:
Work with a sample student dataset and video game sales data:
| TODO | Task | Difficulty | Skills |
|---|---|---|---|
| 1 | Remove Gender Column | Easy | DataFrame manipulation |
| 2 | Remove Duplicate Rows | Easy | Duplicate handling |
| 3 | Remove Missing Values | Easy | Missing data management |
| 4 | Remove Multiple Columns | Easy | Bulk column operations |
| 5 | Check Missing Values | Easy | Data inspection |
| 6 | Clean Real Dataset | Easy | Applying cleaning pipeline |
Key Concepts: drop(), drop_duplicates(), dropna(), isnull().sum()
Learn to transform data for machine learning:
| TODO | Task | Difficulty | Skills |
|---|---|---|---|
| 7 | Setup Label Encoder | Medium | Categorical encoding |
| 8 | Transform Country Data | Medium | Applying encoders |
| 9 | Setup MinMaxScaler | Medium | Numerical scaling |
| 10 | Scale Score Data | Medium | Applying scalers |
Key Concepts: LabelEncoder, MinMaxScaler, fit(), transform()
Apply all concepts in realistic machine learning scenarios:
| TODO | Task | Difficulty | Skills |
|---|---|---|---|
| 11 | Split Sample Data | Medium | Basic train-test split |
| 12 | Setup Gender Encoder | Easy | Encoder repetition |
| 13 | Encode Gender Data | Easy | Transform practice |
| 14 | Setup Category Encoder | Medium | Real dataset encoding |
| 15 | Encode Categories | Medium | Column transformation |
| 16 | Setup Rating Scaler | Medium | Real dataset scaling |
| 17 | Scale Ratings | Medium | Column scaling |
| 18 | Split GPS Data | Hard | Integration challenge |
| 19 | Split Medical Data | Hard | Full pipeline |
Key Concepts: train_test_split(), data preparation pipelines, 80/20 splits
The TODOs are designed to build your skills progressively:
For each TODO, you'll find:
# ⚠️ TODO 7: Setup Label Encoder (Medium)
#
# TASK: Import LabelEncoder, create instance, fit with country_data, print classes
#
# SUCCESS CRITERIA:
# - Correctly import required classes from sklearn
# - Create instance and fit with appropriate data
# - Print classes or parameters to verify
#
# HINTS:
# - Pattern: from sklearn.preprocessing import LabelEncoder
#
# Your code here:
After completing all TODOs, you will:
Completed all TODOs? Try these advanced challenges:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
vgsales.csv): 16,598 games with sales datagoogleplaystore.csv): 10,841 apps with ratingsbreastcancer.csv): Medical dataset for classificationAll datasets download automatically when running the notebook.
❌ Mistake: Forgetting to reassign transformed data
# Wrong
encoder.fit(data)
# Data is not encoded yet!
# Correct
data_encoded = encoder.transform(data)
❌ Mistake: Using wrong shape for scaling
# Wrong
scaler.fit(df['column']) # 1D array
# Correct
scaler.fit(df[['column']]) # 2D DataFrame
❌ Mistake: Not setting random_state
# Non-reproducible
train_test_split(X, y, test_size=0.2)
# Reproducible
train_test_split(X, y, test_size=0.2, random_state=42)
When code doesn't work:
encoder = LabelEncoder())?| Method | Purpose | Example |
|---|---|---|
df.drop() |
Remove columns/rows | df.drop(columns=['col1']) |
df.dropna() |
Remove missing values | df.dropna() |
LabelEncoder |
Encode categories | encoder.fit_transform(data) |
MinMaxScaler |
Scale to 0-1 range | scaler.fit_transform(data) |
train_test_split |
Split data | X_train, X_test, y_train, y_test = train_test_split(X, y) |
You'll know you've succeeded when:
print() to see what your data looks likeEstimated Time: 90-120 minutes
Difficulty Breakdown:
Good luck with your data preparation challenge! 🚀