Demo Mode

Lesson 5 of 18

Concept 5: Data Preparation

Data Preparation for Machine Learning

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Explore and understand datasets using Pandas and NumPy
Clean data by removing duplicates, missing values, and unnecessary information
Encode categorical data into numbers for machine learning
Scale numerical data to improve model performance
Split datasets into training and testing sets

:information_source: Data Preparation is the process of getting your data ready for machine learning. Think of it like preparing ingredients before cooking - you need to clean, chop, and organize everything before you can make a great meal!

To prepare data for machine learning models, you need to follow these four main steps:

Explore the Dataset - Look at your data to understand what you have
Data Cleaning - Remove errors and fix problems in your data
Data Preprocessing - Transform data into a format that machines can understand
Dataset Splitting - Divide data into training and testing sets

:mag: Exploring the Dataset

Before you start processing data, you need to understand what you're working with. It's like examining your ingredients before cooking!

:bulb: Always explore your data first! This helps you spot problems early and understand what cleaning steps you'll need.

Let's review the 4 basic Pandas commands that help you explore any dataset:

One. columns - Shows all column names

This command shows you all the column headers in your dataset.
python
data.columns
text
Index(['students', 'math', 'science', 'english'], dtype='object')
2. index - Shows all row labels

This command displays the row identifiers (usually numbers starting from 0).
python
data.index
text
RangeIndex(start=0, stop=5, step=1)
3. head() - Shows the first 5 rows

This command gives you a quick peek at your data by showing the first 5 entries.
python
data.head()
text
   students  math  science  english
0      Adam    87       78       90
1       Bob    42       51       66
2   Crystal    68       50       42
3     David    99       86       83
4    Edmund    53       70       91
4. info() - Shows basic information about your dataset

This command provides a summary including data types and memory usage.
python
data.info()
text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   students  5 non-null      object
 1   math      5 non-null      int64
 2   science   5 non-null      int64
 3   english   5 non-null      int64
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes
Now let's learn 2 additional NumPy methods that help you understand your dataset's structure:

One. shape - Shows dataset dimensions (rows x columns)
python
data.shape
text
(5, 4)
This output tells you the dataset has 5 rows (data points) and 4 columns (features).

2. len() - Shows the total number of rows
python
len(data)
text
5
This confirms that your dataset contains 5 rows of data. note What these commands tell you:

Total amount of data - How many examples you have to work with
Number of features - How many characteristics describe each data point
Data types - Whether columns contain numbers, text, or other types

These exploration tools help you answer important questions before you start cleaning your data!

:emoji: Data Cleaning

After exploring your data, you need to clean it! Data cleaning removes errors and inconsistencies that could confuse your machine learning model.

We'll learn 3 essential data cleaning techniques using Pandas:

Removing Columns and Rows - Delete unnecessary information
Removing Duplicates - Eliminate repeated data entries
Removing Missing Values - Handle incomplete data

Removing Columns and Rows

Sometimes you need to remove unnecessary data. The df.drop() function helps you delete specific rows or columns.

Key parameters:

index=[index_names] - Specifies which rows to remove
columns=[columns_names] - Specifies which columns to remove
inplace= bool - If True, modifies the original data; if False, creates a copy

Example: Removing a column

python

# This removes the 'Gender' column permanently
data.drop(columns=['Gender'], inplace=True)

Example: Removing a row

python

# This removes row at index 1 (the second row)
data.drop(index=[1])

Removing Duplicates

Duplicate data can confuse your model. Let's learn how to find and remove duplicates!

Step One: Check if a column has all unique values

python

# Returns True if all values are unique, False if duplicates exist
data['Player'].is_unique

Step 2: See all unique values in a column

python

# Shows each unique value only once
data['City'].unique()

Step 3: Remove duplicate rows Use df.drop_duplicates() with these parameters:

subset=[columns_names] - Which columns to check for duplicates
keep= "first" or "last" - Keep the first or last occurrence (default: "first")
inplace= bool - Modify original data (True) or create a copy (False)

Example:

python

# Keeps only the first occurrence of each name
data.drop_duplicates(subset=['name'], keep='first', inplace=True)

:bulb: Always check for duplicates before training your model! Duplicate data can make your model think certain patterns are more important than they really are.

Removing Missing Values

Missing values (like empty cells) can break your machine learning model. Here's how to handle them:

Step One: Find missing values
python
# Shows how many missing values are in each column
data.isna().sum()
Step 2: Remove rows with missing values Use df.dropna() with these parameters:

subset=[columns_names] - Which columns to check for missing values

inplace= bool - Modify original data (True) or create a copy (False)

Example:
python
# Removes any row where 'Category' column is empty
data.dropna(subset=['Category'], inplace=True)
note Missing Value Strategy:

If only a few values are missing -> Remove those rows
If many values are missing -> Consider filling them with averages or most common values
If an entire column is mostly empty -> Consider dropping the column

:hammer_and_wrench: Introduction to Scikit-Learn

Before we learn about data conversion, let's meet your new best friend for machine learning: Scikit-Learn!

:information_source: Scikit-Learn is a powerful Python library that makes machine learning easy. Think of it as a toolbox filled with ready-to-use machine learning tools!

What can Scikit-Learn do? Scikit-Learn helps you with all three stages of machine learning:

Data Preparation - Clean and transform your data
Model Training - Teach your model to recognize patterns
Model Testing - Check how well your model performs

Who uses Scikit-Learn? Many famous companies use it:

Evernote - To understand how users organize notes
AWeber - To predict customer behavior

:memo: Important: Scikit-Learn is perfect for traditional machine learning, but for deep learning (like image recognition), you'll need other tools like TensorFlow or PyTorch.

To learn more about scikit-learn, visit their website: https://scikit-learn.org/stable/index.html

Installing Scikit-Learn

Since Scikit-Learn is an external library, you need to install it first.

Good news for Google Colab users: :tada:
Scikit-Learn comes pre-installed in Google Colab! You can start using it right away.

For local installation:
If you want to install it on your computer, follow the guide here: https://scikit-learn.org/stable/install.html#install-official-release

:emoji: Data Preprocessing

After cleaning your data, you need to transform it into a format that machines can understand better.

:information_source: Data Preprocessing transforms your data to make it easier for machine learning models to learn patterns. It's like translating human-readable information into machine language!

We'll learn 2 essential preprocessing techniques:

Data Encoding - Convert text categories into numbers

Data Scaling - Make all numbers comparable in size

Data Encoding

Computers only understand numbers, not words! Data Encoding converts text categories (like "Male" or "Female") into numbers that machines can process.

Example: Encoding Gender

Before Encoding:

Name Gender

Jason Male

Jimmy Male

Lisa Female

After Encoding:

Name Gender

Jason 0

Jimmy 0

Lisa 1

tip

Name	Gender
Jason	Male
Jimmy	Male
Lisa	Female

Name	Gender
Jason	0
Jimmy	0
Lisa	1
tip

The computer now sees:

Male = 0
Female = 1

This makes it easy for the model to do math with the data!

How to encode data using Scikit-Learn:

Import and create the encoder

python

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

Teach the encoder what categories exist

python

le.fit(["paris", "paris", "tokyo", "amsterdam"])

The encoder learns the unique values and assigns numbers alphabetically:

"amsterdam" -> 0
"paris" -> 1
"tokyo" -> 2

Check what the encoder learned

python

list(le.classes_)

Output:

text

['amsterdam', 'paris', 'tokyo']

Convert new data using the encoder

python

le.transform(["tokyo", "tokyo", "paris"])

Output:

text

array([2, 2, 1]...)

Data Scaling

Data scaling makes all your numbers comparable by putting them on the same scale. This helps the model learn faster!

:memo: Why do we need scaling? When numbers have very different ranges, the model might think bigger numbers are more important. For example:

Age: 5-15 years

Income: $10,000-$1,000,000

Without scaling, the model might ignore age because income numbers are so much bigger!

Examples of values that need scaling:

Company profits - Can range from thousands to millions
Human weight - Typically 40kg to 80kg
Test scores - Usually 0 to 100

Solution: Scale all values to be between 0 and 1!

Example: Scaling Exam Marks

Before Scaling:

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80
Alvin	0

After Scaling:

Name	Exam Marks
Jason	One.00
Jimmy	0.20
Lisa	0.80
Alvin	0.00

How to scale data using Scikit-Learn's MinMaxScaler:

Import and create the scaler

python

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Prepare your data in the right format

python

# Data must be in 2D array format (rows and columns)
data = np.array([[100], [20], [80], [0]])

:bulb: If your data is 1D, reshape it first:
python
data = np.array([100, 20, 80, 0])
data = data.reshape(-1, 1)  # Convert to 2D

Teach the scaler the data range

python

scaler.fit(data)

Check what the scaler learned

python

print('Min:', scaler.data_min_)
print('Max:', scaler.data_max_)

Output:

text

Min: [0.]
Max: [100.]

Scale your data

python

scaler.transform(data)

Output:

text

[[1. ] [0.2] [0.8] [0. ]]

:emoji:️ Dataset Splitting

The final step in data preparation is splitting your data into two separate sets:

:information_source: Why split data? Think of it like studying for a test:

You study with practice questions (training data)

You take the actual test with new questions (testing data)

This ensures your model really learned, not just memorized!

The two datasets you need:

Training Dataset - The model learns patterns from this data (usually 70-80% of your data)
Testing Dataset - Used to check if the model learned correctly (usually 20-30% of your data)

Example: Splitting a Dataset

Original Dataset (5 students):

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80
Alvin	0
Alice	67

Training Dataset (60% - 3 students):

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80

Testing Dataset (40% - 2 students):

Name	Exam Marks
Alvin	0
Alice	67

How to split data using Scikit-Learn:

Import the splitting function

python

from sklearn.model_selection import train_test_split

Split your data

python

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=33)

Understanding the parameters:

x = Your features (the data columns used for prediction)
y = Your labels (what you want to predict)
test_size=0.33 = Use 33% for testing, 67% for training
random_state=33 = Ensures the same split every time (for reproducible results)

What you get back:

x_train = Training features (67% of data)
x_test = Testing features (33% of data)
y_train = Training labels
y_test = Testing labels

:bulb: Common split ratios:

80/20 -> test_size=0.20 (Most common)

70/30 -> test_size=0.30

67/33 -> test_size=0.33

Choose based on how much data you have!

:memo: Summary

You've learned the complete data preparation pipeline for machine learning:

Explore Data - Use Pandas commands (.columns, .head(), .info(), .shape) to understand your dataset
Clean Data - Remove duplicates, missing values, and unnecessary information
Preprocess Data:
- Encode text categories into numbers
- Scale numerical values to a common range (0-1)
Split Data - Divide into training (learn) and testing (verify) sets

Remember: Good data preparation is the foundation of successful machine learning!

:emoji: Video

:emoji: AI Prompts

Code with AI Try using AI to learn data preprocessing techniques.

Practice Prompts:

"How do I handle missing values in a dataset using scikit-learn?"
"Write Python code to split a dataset into training and testing sets."
"Show me how to encode categorical variables like 'color' with values ['red', 'blue', 'green']"
"Create a complete data preparation pipeline for a student grades dataset"

:dart: Practice Activities

Data Explorer Challenge
- Load any dataset and use all 6 exploration commands we learned
- Write down 3 insights about your data
Cleaning Detective
- Find a dataset with missing values and duplicates
- Clean it using the techniques from this lesson
- Compare the dataset size before and after cleaning
Encoding Practice
- Create a small dataset with text categories (like favorite foods or colors)
- Encode them into numbers using LabelEncoder
- Explain what number each category received
Scaling Challenge
- Create data with very different ranges (age: 10-15, salary: 10000-100000)
- Scale both columns and compare the results
- Explain why scaling helped
Split Master
- Take a dataset with at least 100 rows
- Try different split ratios (80/20, 70/30, 60/40)
- Discuss which ratio works best for your data size

Lesson 5 of 18

Concept 5: Data Preparation

Data Preparation for Machine Learning

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Explore and understand datasets using Pandas and NumPy
Clean data by removing duplicates, missing values, and unnecessary information
Encode categorical data into numbers for machine learning
Scale numerical data to improve model performance
Split datasets into training and testing sets

:information_source: Data Preparation is the process of getting your data ready for machine learning. Think of it like preparing ingredients before cooking - you need to clean, chop, and organize everything before you can make a great meal!

To prepare data for machine learning models, you need to follow these four main steps:

Explore the Dataset - Look at your data to understand what you have
Data Cleaning - Remove errors and fix problems in your data
Data Preprocessing - Transform data into a format that machines can understand
Dataset Splitting - Divide data into training and testing sets

:mag: Exploring the Dataset

Before you start processing data, you need to understand what you're working with. It's like examining your ingredients before cooking!

:bulb: Always explore your data first! This helps you spot problems early and understand what cleaning steps you'll need.

Let's review the 4 basic Pandas commands that help you explore any dataset:

One. columns - Shows all column names

This command shows you all the column headers in your dataset.
python
data.columns
text
Index(['students', 'math', 'science', 'english'], dtype='object')
2. index - Shows all row labels

This command displays the row identifiers (usually numbers starting from 0).
python
data.index
text
RangeIndex(start=0, stop=5, step=1)
3. head() - Shows the first 5 rows

This command gives you a quick peek at your data by showing the first 5 entries.
python
data.head()
text
   students  math  science  english
0      Adam    87       78       90
1       Bob    42       51       66
2   Crystal    68       50       42
3     David    99       86       83
4    Edmund    53       70       91
4. info() - Shows basic information about your dataset

This command provides a summary including data types and memory usage.
python
data.info()
text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   students  5 non-null      object
 1   math      5 non-null      int64
 2   science   5 non-null      int64
 3   english   5 non-null      int64
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes
Now let's learn 2 additional NumPy methods that help you understand your dataset's structure:

One. shape - Shows dataset dimensions (rows x columns)
python
data.shape
text
(5, 4)
This output tells you the dataset has 5 rows (data points) and 4 columns (features).

2. len() - Shows the total number of rows
python
len(data)
text
5
This confirms that your dataset contains 5 rows of data. note What these commands tell you:

Total amount of data - How many examples you have to work with
Number of features - How many characteristics describe each data point
Data types - Whether columns contain numbers, text, or other types

These exploration tools help you answer important questions before you start cleaning your data!

:emoji: Data Cleaning

After exploring your data, you need to clean it! Data cleaning removes errors and inconsistencies that could confuse your machine learning model.

We'll learn 3 essential data cleaning techniques using Pandas:

Removing Columns and Rows - Delete unnecessary information
Removing Duplicates - Eliminate repeated data entries
Removing Missing Values - Handle incomplete data

Removing Columns and Rows

Sometimes you need to remove unnecessary data. The df.drop() function helps you delete specific rows or columns.

Key parameters:

index=[index_names] - Specifies which rows to remove
columns=[columns_names] - Specifies which columns to remove
inplace= bool - If True, modifies the original data; if False, creates a copy

Example: Removing a column

python

# This removes the 'Gender' column permanently
data.drop(columns=['Gender'], inplace=True)

Example: Removing a row

python

# This removes row at index 1 (the second row)
data.drop(index=[1])

Removing Duplicates

Duplicate data can confuse your model. Let's learn how to find and remove duplicates!

Step One: Check if a column has all unique values

python

# Returns True if all values are unique, False if duplicates exist
data['Player'].is_unique

Step 2: See all unique values in a column

python

# Shows each unique value only once
data['City'].unique()

Step 3: Remove duplicate rows Use df.drop_duplicates() with these parameters:

subset=[columns_names] - Which columns to check for duplicates
keep= "first" or "last" - Keep the first or last occurrence (default: "first")
inplace= bool - Modify original data (True) or create a copy (False)

Example:

python

# Keeps only the first occurrence of each name
data.drop_duplicates(subset=['name'], keep='first', inplace=True)

:bulb: Always check for duplicates before training your model! Duplicate data can make your model think certain patterns are more important than they really are.

Removing Missing Values

Missing values (like empty cells) can break your machine learning model. Here's how to handle them:

Step One: Find missing values
python
# Shows how many missing values are in each column
data.isna().sum()
Step 2: Remove rows with missing values Use df.dropna() with these parameters:

subset=[columns_names] - Which columns to check for missing values

inplace= bool - Modify original data (True) or create a copy (False)

Example:
python
# Removes any row where 'Category' column is empty
data.dropna(subset=['Category'], inplace=True)
note Missing Value Strategy:

If only a few values are missing -> Remove those rows
If many values are missing -> Consider filling them with averages or most common values
If an entire column is mostly empty -> Consider dropping the column

:hammer_and_wrench: Introduction to Scikit-Learn

Before we learn about data conversion, let's meet your new best friend for machine learning: Scikit-Learn!

:information_source: Scikit-Learn is a powerful Python library that makes machine learning easy. Think of it as a toolbox filled with ready-to-use machine learning tools!

What can Scikit-Learn do? Scikit-Learn helps you with all three stages of machine learning:

Data Preparation - Clean and transform your data
Model Training - Teach your model to recognize patterns
Model Testing - Check how well your model performs

Who uses Scikit-Learn? Many famous companies use it:

Evernote - To understand how users organize notes
AWeber - To predict customer behavior

:memo: Important: Scikit-Learn is perfect for traditional machine learning, but for deep learning (like image recognition), you'll need other tools like TensorFlow or PyTorch.

To learn more about scikit-learn, visit their website: https://scikit-learn.org/stable/index.html

Installing Scikit-Learn

Since Scikit-Learn is an external library, you need to install it first.

Good news for Google Colab users: :tada:
Scikit-Learn comes pre-installed in Google Colab! You can start using it right away.

For local installation:
If you want to install it on your computer, follow the guide here: https://scikit-learn.org/stable/install.html#install-official-release

:emoji: Data Preprocessing

After cleaning your data, you need to transform it into a format that machines can understand better.

:information_source: Data Preprocessing transforms your data to make it easier for machine learning models to learn patterns. It's like translating human-readable information into machine language!

We'll learn 2 essential preprocessing techniques:

Data Encoding - Convert text categories into numbers

Data Scaling - Make all numbers comparable in size

Data Encoding

Computers only understand numbers, not words! Data Encoding converts text categories (like "Male" or "Female") into numbers that machines can process.

Example: Encoding Gender

Before Encoding:

Name Gender

Jason Male

Jimmy Male

Lisa Female

After Encoding:

Name Gender

Jason 0

Jimmy 0

Lisa 1

tip

Name	Gender
Jason	Male
Jimmy	Male
Lisa	Female

Name	Gender
Jason	0
Jimmy	0
Lisa	1
tip

The computer now sees:

Male = 0
Female = 1

This makes it easy for the model to do math with the data!

How to encode data using Scikit-Learn:

Import and create the encoder

python

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

Teach the encoder what categories exist

python

le.fit(["paris", "paris", "tokyo", "amsterdam"])

The encoder learns the unique values and assigns numbers alphabetically:

"amsterdam" -> 0
"paris" -> 1
"tokyo" -> 2

Check what the encoder learned

python

list(le.classes_)

Output:

text

['amsterdam', 'paris', 'tokyo']

Convert new data using the encoder

python

le.transform(["tokyo", "tokyo", "paris"])

Output:

text

array([2, 2, 1]...)

Data Scaling

Data scaling makes all your numbers comparable by putting them on the same scale. This helps the model learn faster!

:memo: Why do we need scaling? When numbers have very different ranges, the model might think bigger numbers are more important. For example:

Age: 5-15 years

Income: $10,000-$1,000,000

Without scaling, the model might ignore age because income numbers are so much bigger!

Examples of values that need scaling:

Company profits - Can range from thousands to millions
Human weight - Typically 40kg to 80kg
Test scores - Usually 0 to 100

Solution: Scale all values to be between 0 and 1!

Example: Scaling Exam Marks

Before Scaling:

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80
Alvin	0

After Scaling:

Name	Exam Marks
Jason	One.00
Jimmy	0.20
Lisa	0.80
Alvin	0.00

How to scale data using Scikit-Learn's MinMaxScaler:

Import and create the scaler

python

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Prepare your data in the right format

python

# Data must be in 2D array format (rows and columns)
data = np.array([[100], [20], [80], [0]])

:bulb: If your data is 1D, reshape it first:
python
data = np.array([100, 20, 80, 0])
data = data.reshape(-1, 1)  # Convert to 2D

Teach the scaler the data range

python

scaler.fit(data)

Check what the scaler learned

python

print('Min:', scaler.data_min_)
print('Max:', scaler.data_max_)

Output:

text

Min: [0.]
Max: [100.]

Scale your data

python

scaler.transform(data)

Output:

text

[[1. ] [0.2] [0.8] [0. ]]

:emoji:️ Dataset Splitting

The final step in data preparation is splitting your data into two separate sets:

:information_source: Why split data? Think of it like studying for a test:

You study with practice questions (training data)

You take the actual test with new questions (testing data)

This ensures your model really learned, not just memorized!

The two datasets you need:

Training Dataset - The model learns patterns from this data (usually 70-80% of your data)
Testing Dataset - Used to check if the model learned correctly (usually 20-30% of your data)

Example: Splitting a Dataset

Original Dataset (5 students):

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80
Alvin	0
Alice	67

Training Dataset (60% - 3 students):

Name	Exam Marks
Jason	100
Jimmy	20
Lisa	80

Testing Dataset (40% - 2 students):

Name	Exam Marks
Alvin	0
Alice	67

How to split data using Scikit-Learn:

Import the splitting function

python

from sklearn.model_selection import train_test_split

Split your data

python

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=33)

Understanding the parameters:

x = Your features (the data columns used for prediction)
y = Your labels (what you want to predict)
test_size=0.33 = Use 33% for testing, 67% for training
random_state=33 = Ensures the same split every time (for reproducible results)

What you get back:

x_train = Training features (67% of data)
x_test = Testing features (33% of data)
y_train = Training labels
y_test = Testing labels

:bulb: Common split ratios:

80/20 -> test_size=0.20 (Most common)

70/30 -> test_size=0.30

67/33 -> test_size=0.33

Choose based on how much data you have!

:memo: Summary

You've learned the complete data preparation pipeline for machine learning:

Explore Data - Use Pandas commands (.columns, .head(), .info(), .shape) to understand your dataset
Clean Data - Remove duplicates, missing values, and unnecessary information
Preprocess Data:
- Encode text categories into numbers
- Scale numerical values to a common range (0-1)
Split Data - Divide into training (learn) and testing (verify) sets

Remember: Good data preparation is the foundation of successful machine learning!

:emoji: Video

:emoji: AI Prompts

Code with AI Try using AI to learn data preprocessing techniques.

Practice Prompts:

"How do I handle missing values in a dataset using scikit-learn?"
"Write Python code to split a dataset into training and testing sets."
"Show me how to encode categorical variables like 'color' with values ['red', 'blue', 'green']"
"Create a complete data preparation pipeline for a student grades dataset"

:dart: Practice Activities

Data Explorer Challenge
- Load any dataset and use all 6 exploration commands we learned
- Write down 3 insights about your data
Cleaning Detective
- Find a dataset with missing values and duplicates
- Clean it using the techniques from this lesson
- Compare the dataset size before and after cleaning
Encoding Practice
- Create a small dataset with text categories (like favorite foods or colors)
- Encode them into numbers using LabelEncoder
- Explain what number each category received
Scaling Challenge
- Create data with very different ranges (age: 10-15, salary: 10000-100000)
- Scale both columns and compare the results
- Explain why scaling helped
Split Master
- Take a dataset with at least 100 rows
- Try different split ratios (80/20, 70/30, 60/40)
- Discuss which ratio works best for your data size