Practice and reinforce the concepts from Lesson 5
In this hands-on exercise, you'll gain practical experience with:
Total Time Required: 90-120 minutes
Access the exercise template here: Data Preparation Exercise Template
In this section, you will:
Handle missing values in the dataset
Remove duplicate entries
Fix inconsistent data formats
Clean text data
:bulb: Common Challenge: When handling missing values, consider the data type and context. Numerical data might use mean/median, while categorical data might use mode or a special category.
:computer: Part 2: Data Preprocessing (30-40 minutes)
You will learn to:
- Scale numerical features
- Apply StandardScaler or MinMaxScaler
- Understand when to use each method
- Validate the scaling results
- Encode categorical variables
- Use OneHotEncoder for nominal data
- Apply LabelEncoder for ordinal data
- Handle unknown categories
- Handle imbalanced datasets
- Identify class imbalance
- Apply SMOTE or undersampling techniques
- Verify the balanced distribution
- Create new features
- Engineer meaningful features from existing data
- Combine features for better insights
- Document your feature creation logic tip Best Practice: Always fit your scalers and encoders on the training data only, then transform both training and test sets. This prevents data leakage!
This section covers:
Split data into training and testing sets
Configure split parameters
Validate the split results
Implementation Task:
test_size=0.2
and random_state=42
:warning: Warning Important: Your submission must use the exact parameters specified above for consistency in grading.
:information_source: Submission Deadline : Submit your completed notebook within the allocated class time or as specified by your instructor.
Please submit your work through this link: Exercise Submission Form
test_size=0.2
for data splittingrandom_state=42
for reproducibilityDataPrep_YourName.ipynb
:warning: Warning Before Submitting:
- Restart your kernel and run all cells to ensure everything works
- Double-check that all outputs are visible
- Verify you've used the correct parameters
Import Errors
# Make sure to install required packages
!pip install pandas numpy scikit-learn imbalanced-learn
Memory Issues
del variable_name
Scaling Errors
Encoding Issues
handle_unknown='ignore'
parameter:bulb: Tip Need Help? If you encounter issues not listed here, ask your instructor or post in the class discussion forum.