By the end of this lesson, you will be able to:
:information_source: Data Preparation is the process of getting your data ready for machine learning. Think of it like preparing ingredients before cooking - you need to clean, chop, and organize everything before you can make a great meal!
To prepare data for machine learning models, you need to follow these four main steps:
Before you start processing data, you need to understand what you're working with. It's like examining your ingredients before cooking!
:bulb: Always explore your data first! This helps you spot problems early and understand what cleaning steps you'll need.
Let's review the 4 basic Pandas commands that help you explore any dataset:
One. columns - Shows all column names
This command shows you all the column headers in your dataset.
pythondata.columns
textIndex(['students', 'math', 'science', 'english'], dtype='object')
2. index - Shows all row labels
This command displays the row identifiers (usually numbers starting from 0).
pythondata.index
textRangeIndex(start=0, stop=5, step=1)
3. head() - Shows the first 5 rows
This command gives you a quick peek at your data by showing the first 5 entries.
pythondata.head()
textstudents math science english 0 Adam 87 78 90 1 Bob 42 51 66 2 Crystal 68 50 42 3 David 99 86 83 4 Edmund 53 70 91
4. info() - Shows basic information about your dataset
This command provides a summary including data types and memory usage.
pythondata.info()
text<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 students 5 non-null object 1 math 5 non-null int64 2 science 5 non-null int64 3 english 5 non-null int64 dtypes: int64(3), object(1) memory usage: 288.0+ bytes
Now let's learn 2 additional NumPy methods that help you understand your dataset's structure:
One. shape - Shows dataset dimensions (rows x columns)
pythondata.shape
text(5, 4)
This output tells you the dataset has 5 rows (data points) and 4 columns (features).
2. len() - Shows the total number of rows
python
len(data)
text5
This confirms that your dataset contains 5 rows of data. note What these commands tell you:
These exploration tools help you answer important questions before you start cleaning your data!
After exploring your data, you need to clean it! Data cleaning removes errors and inconsistencies that could confuse your machine learning model.
We'll learn 3 essential data cleaning techniques using Pandas:
Sometimes you need to remove unnecessary data. The df.drop()
function helps you delete specific rows or columns.
Key parameters:
[index_names]
- Specifies which rows to remove[columns_names]
- Specifies which columns to removebool
- If True, modifies the original data; if False, creates a copyExample: Removing a column
# This removes the 'Gender' column permanently
data.drop(columns=['Gender'], inplace=True)
Example: Removing a row
# This removes row at index 1 (the second row)
data.drop(index=[1])
Duplicate data can confuse your model. Let's learn how to find and remove duplicates!
Step One: Check if a column has all unique values
# Returns True if all values are unique, False if duplicates exist
data['Player'].is_unique
Step 2: See all unique values in a column
# Shows each unique value only once
data['City'].unique()
Step 3: Remove duplicate rows
Use df.drop_duplicates()
with these parameters:
[columns_names]
- Which columns to check for duplicates"first" or "last"
- Keep the first or last occurrence (default: "first")bool
- Modify original data (True) or create a copy (False)Example:
# Keeps only the first occurrence of each name
data.drop_duplicates(subset=['name'], keep='first', inplace=True)
:bulb: Always check for duplicates before training your model! Duplicate data can make your model think certain patterns are more important than they really are.
Removing Missing Values
Missing values (like empty cells) can break your machine learning model. Here's how to handle them:
Step One: Find missing values
python
# Shows how many missing values are in each column data.isna().sum()
Step 2: Remove rows with missing values Use
df.dropna()
with these parameters:
- subset=
[columns_names]
- Which columns to check for missing values- inplace=
bool
- Modify original data (True) or create a copy (False)Example:
python
# Removes any row where 'Category' column is empty data.dropna(subset=['Category'], inplace=True)
note Missing Value Strategy:
Before we learn about data conversion, let's meet your new best friend for machine learning: Scikit-Learn!
:information_source: Scikit-Learn is a powerful Python library that makes machine learning easy. Think of it as a toolbox filled with ready-to-use machine learning tools!
What can Scikit-Learn do? Scikit-Learn helps you with all three stages of machine learning:
Who uses Scikit-Learn? Many famous companies use it:
:memo: Important: Scikit-Learn is perfect for traditional machine learning, but for deep learning (like image recognition), you'll need other tools like TensorFlow or PyTorch.
To learn more about scikit-learn, visit their website: https://scikit-learn.org/stable/index.html
Installing Scikit-Learn
Since Scikit-Learn is an external library, you need to install it first.
Good news for Google Colab users: :tada:
Scikit-Learn comes pre-installed in Google Colab! You can start using it right away.For local installation:
If you want to install it on your computer, follow the guide here: https://scikit-learn.org/stable/install.html#install-official-release:emoji: Data Preprocessing
After cleaning your data, you need to transform it into a format that machines can understand better.
:information_source: Data Preprocessing transforms your data to make it easier for machine learning models to learn patterns. It's like translating human-readable information into machine language!
We'll learn 2 essential preprocessing techniques:
- Data Encoding - Convert text categories into numbers
- Data Scaling - Make all numbers comparable in size
Data Encoding
Computers only understand numbers, not words! Data Encoding converts text categories (like "Male" or "Female") into numbers that machines can process.
Example: Encoding Gender
Before Encoding:
Name Gender Jason Male Jimmy Male Lisa Female After Encoding:
Name Gender Jason 0 Jimmy 0 Lisa 1 tip
The computer now sees:
This makes it easy for the model to do math with the data!
How to encode data using Scikit-Learn:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
The encoder learns the unique values and assigns numbers alphabetically:
list(le.classes_)
Output:
['amsterdam', 'paris', 'tokyo']
le.transform(["tokyo", "tokyo", "paris"])
Output:
array([2, 2, 1]...)
Data scaling makes all your numbers comparable by putting them on the same scale. This helps the model learn faster!
:memo: Why do we need scaling? When numbers have very different ranges, the model might think bigger numbers are more important. For example:
- Age: 5-15 years
- Income: $10,000-$1,000,000
Without scaling, the model might ignore age because income numbers are so much bigger!
Examples of values that need scaling:
Solution: Scale all values to be between 0 and 1!
Example: Scaling Exam Marks
Before Scaling:
Name | Exam Marks |
---|---|
Jason | 100 |
Jimmy | 20 |
Lisa | 80 |
Alvin | 0 |
After Scaling:
Name | Exam Marks |
---|---|
Jason | One.00 |
Jimmy | 0.20 |
Lisa | 0.80 |
Alvin | 0.00 |
How to scale data using Scikit-Learn's MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# Data must be in 2D array format (rows and columns)
data = np.array([[100], [20], [80], [0]])
:bulb: If your data is 1D, reshape it first:
python
data = np.array([100, 20, 80, 0]) data = data.reshape(-1, 1) # Convert to 2D
scaler.fit(data)
print('Min:', scaler.data_min_)
print('Max:', scaler.data_max_)
Output:
Min: [0.]
Max: [100.]
scaler.transform(data)
Output:
[[1. ] [0.2] [0.8] [0. ]]
The final step in data preparation is splitting your data into two separate sets:
:information_source: Why split data? Think of it like studying for a test:
- You study with practice questions (training data)
- You take the actual test with new questions (testing data)
This ensures your model really learned, not just memorized!
The two datasets you need:
Example: Splitting a Dataset
Original Dataset (5 students):
Name | Exam Marks |
---|---|
Jason | 100 |
Jimmy | 20 |
Lisa | 80 |
Alvin | 0 |
Alice | 67 |
Training Dataset (60% - 3 students):
Name | Exam Marks |
---|---|
Jason | 100 |
Jimmy | 20 |
Lisa | 80 |
Testing Dataset (40% - 2 students):
Name | Exam Marks |
---|---|
Alvin | 0 |
Alice | 67 |
How to split data using Scikit-Learn:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=33)
Understanding the parameters:
x
= Your features (the data columns used for prediction)y
= Your labels (what you want to predict)test_size=0.33
= Use 33% for testing, 67% for trainingrandom_state=33
= Ensures the same split every time (for reproducible results)What you get back:
x_train
= Training features (67% of data)x_test
= Testing features (33% of data)y_train
= Training labelsy_test
= Testing labels:bulb: Common split ratios:
- 80/20 ->
test_size=0.20
(Most common)- 70/30 ->
test_size=0.30
- 67/33 ->
test_size=0.33
Choose based on how much data you have!
You've learned the complete data preparation pipeline for machine learning:
.columns
, .head()
, .info()
, .shape
) to understand your datasetRemember: Good data preparation is the foundation of successful machine learning!
Code with AI Try using AI to learn data preprocessing techniques.
Practice Prompts:
Data Explorer Challenge
Cleaning Detective
Encoding Practice
Scaling Challenge
Split Master