By the end of this lesson, you will be able to:
:information_source: Definition: Data cleaning is the process of finding and fixing incorrect data in datasets. Think of it like proofreading and editing a document, but for data!
Datasets often contain:
Data cleaning helps us fix these problems to get accurate results!
We'll learn 4 powerful methods to clean data:
We'll use salary_data.csv to practice data cleaning. You can download it here.
This dataset contains information about 10 workers including:
:memo: Notice that our dataset has some problems! It contains missing values (NaN), invalid data, and duplicates. Let's learn how to fix these issues!
outputID NAME GENDER OCCUPATION SALARY CITY 0 111 John M Engineer RM3,700 Kuala Lumpur 2 333 Alice F Accountant RM4,500 Malacca 3 444 Mark M NaN NaN Kota Kinabalu 4 555 Alex A Surgeon RM9,350 Kuala Lumpur 5 666 Peter M Banker RM5,650 Kuala Lumpur 6 777 Calvin M Actuary RM7,000 George Town 7 888 Roxane F Engineer RM7,650 Johor Bahru 8 999 Roxane F Engineer RM7,650 Johor Bahru 9 1111 Michael M Office clerk RM2,900 Kuching
1️⃣ Removing Columns and Rows
Sometimes datasets have too much information. We need to keep only what's important and remove the rest.
How to Remove Data
Use the df.drop() command with these options:
- index=[index_names] - Remove specific rows by their index number
- columns=[column_names] - Remove specific columns by their name
- inplace=True/False - Whether to save changes to the original data
Examples:
Remove a row (row with index 1):
python
df.drop(index=[1])
Expected output:
outputID NAME GENDER OCCUPATION SALARY CITY 0 111 John M Engineer RM3,700 Kuala Lumpur 2 333 Alice F Accountant RM4,500 Malacca <------ 3 444 Mark M NaN NaN Kota Kinabalu 4 555 Alex A Surgeon RM9,350 Kuala Lumpur 5 666 Peter M Banker RM5,650 Kuala Lumpur 6 777 Calvin M Actuary RM7,000 George Town 7 888 Roxane F Engineer RM7,650 Johor Bahru 8 999 Roxane F Engineer RM7,650 Johor Bahru 9 1111 Michael M Office clerk RM2,900 Kuching
Remove a column (the "GENDER" column) and save the changes:
python
df.drop(columns=['GENDER'], inplace=True)
Expected output:
output| | ▼ ID NAME OCCUPATION SALARY CITY 0 111 John Engineer RM3,700 Kuala Lumpur 1 222 Mery Businessman RM6,900 George Town 2 333 Alice Accountant RM4,500 Malacca 3 444 Mark NaN NaN Kota Kinabalu 4 555 Alex Surgeon RM9,350 Kuala Lumpur 5 666 Peter Banker RM5,650 Kuala Lumpur 6 777 Calvin Actuary RM7,000 George Town 7 888 Roxane Engineer RM7,650 Johor Bahru 8 999 Roxane Engineer RM7,650 Johor Bahru 9 1111 Michael Office clerk RM2,900 Kuching
2️⃣ Finding Duplicate Data :mag:
When we combine different datasets, we often get duplicate (repeated) information. Let's learn how to find these duplicates!
Check if Data is Unique
Use df[column_name].is_unique to check for duplicates:
- Returns True = No duplicates :white_check_mark:
- Returns False = Has duplicates :x:
Example: Check if all cities are unique:
python
df["CITY"].is_unique
The output is False because some cities appear more than once (like Kuala Lumpur).
See All Unique Values
Use df[column_name].unique() to see all unique values in a column.
Example: See all unique cities:
python
df["CITY"].unique()
Expected output:
outputarray(['Kuala Lumpur', 'George Town', 'Malacca', 'Kota Kinabalu', 'Johor Bahru', 'Kuching'], dtype=object)
3️⃣ Removing Duplicates :emoji:️
Now let's remove those duplicate rows!
The drop_duplicates() Command
Use df.drop_duplicates() with these options:
- subset=[column_names] - Which columns to check for duplicates
- keep="first" or "last" - Keep the first or last duplicate (default is "first")
- inplace=True/False - Save changes to original data
Examples:
Remove duplicate names (keeping the first occurrence):
python
df.drop_duplicates(subset=["NAME"])
Expected output:
outputID NAME GENDER OCCUPATION SALARY CITY 0 111 John M Engineer RM3,700 Kuala Lumpur 1 222 Mery F Businessman RM6,900 George Town 2 333 Alice F Accountant RM4,500 Malacca 3 444 Mark M NaN NaN Kota Kinabalu 4 555 Alex A Surgeon RM9,350 Kuala Lumpur 5 666 Peter M Banker RM5,650 Kuala Lumpur 6 777 Calvin M Actuary RM7,000 George Town 7 888 Roxane F Engineer RM7,650 Johor Bahru <----- 9 1111 Michael M Office clerk RM2,900 Kuching
Remove duplicate names (keeping the last occurrence):
python
df.drop_duplicates(subset=["NAME"], keep="last")
Expected output:
outputID NAME GENDER OCCUPATION SALARY CITY 0 111 John M Engineer RM3,700 Kuala Lumpur 1 222 Mery F Businessman RM6,900 George Town 2 333 Alice F Accountant RM4,500 Malacca 3 444 Mark M NaN NaN Kota Kinabalu 4 555 Alex A Surgeon RM9,350 Kuala Lumpur 5 666 Peter M Banker RM5,650 Kuala Lumpur 6 777 Calvin M Actuary RM7,000 George Town 8 999 Roxane F Engineer RM7,650 Johor Bahru < ------- 9 1111 Michael M Office clerk RM2,900 Kuching
Remove duplicates and save changes:
python
df.drop_duplicates(subset=["NAME"], inplace=True)
4️⃣ Finding Missing Values :emoji:️
Datasets often have empty cells or missing information. These appear as:
- "na"
- "n/a"
- "NaN" (Not a Number)
- "NA" (Not Available) tip Pandas only recognizes "NA" and "NaN" as missing values. Other formats need to be converted first!
Use df.isna() to find missing values:
Example:
df.isna()
Expected output:
ID NAME GENDER OCCUPATION SALARY CITY
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False True True False
4 False False False False False False
5 False False False False False False
6 False False False False False False
7 False False False False False False
8 False False False False False False
9 False False False False False False
Use df.isna().sum() to count missing values in each column. Any number greater than 0 means that column has missing data.
Example:
df.isna().sum()
Expected output:
ID 0
NAME 0
GENDER 0
OCCUPATION 1
SALARY 1
CITY 0
dtype: int64
Now let's remove rows that have missing data!
Use df.dropna() with these options:
Remove rows with missing salary data:
df.dropna(subset=["SALARY"])
Expected output:
ID NAME GENDER OCCUPATION SALARY CITY
0 111 John M Engineer RM3,700 Kuala Lumpur
1 222 Mery F Businessman RM6,900 George Town
2 333 Alice F Accountant RM4,500 Malacca <-----
4 555 Alex A Surgeon RM9,350 Kuala Lumpur
5 666 Peter M Banker RM5,650 Kuala Lumpur
6 777 Calvin M Actuary RM7,000 George Town
7 888 Roxane F Engineer RM7,650 Johor Bahru
8 999 Roxane F Engineer RM7,650 Johor Bahru
9 1111 Michael M Office clerk RM2,900 Kuching
Notice that row 3 disappeared because it had a missing salary value!
Remove missing values and save changes:
df.dropna(subset=["SALARY"], inplace=True)
Sometimes data isn't in a format computers can understand. For example:
Use df[column_name].str.replace(old_text, new_text) to fix formatting.
Example One: Remove "RM" from salary data:
df["SALARY"] = df["SALARY"].str.replace("RM", "")
Expected output:
| |
▼
ID NAME GENDER OCCUPATION SALARY CITY
0 111 John M Engineer 3,700 Kuala Lumpur
1 222 Mery F Businessman 6,900 George Town
2 333 Alice F Accountant 4,500 Malacca
4 555 Alex A Surgeon 9,350 Kuala Lumpur
5 666 Peter M Banker 5,650 Kuala Lumpur
6 777 Calvin M Actuary 7,000 George Town
7 888 Roxane F Engineer 7,650 Johor Bahru
8 999 Roxane F Engineer 7,650 Johor Bahru
9 1111 Michael M Office clerk 2,900 Kuching
Example 2: Remove commas from salary data:
df["SALARY"] = df["SALARY"].str.replace(",", "")
Expected output:
| |
▼
ID NAME GENDER OCCUPATION SALARY CITY
0 111 John M Engineer 3700 Kuala Lumpur
1 222 Mery F Businessman 6900 George Town
2 333 Alice F Accountant 4500 Malacca
4 555 Alex A Surgeon 9350 Kuala Lumpur
5 666 Peter M Banker 5650 Kuala Lumpur
6 777 Calvin M Actuary 7000 George Town
7 888 Roxane F Engineer 7650 Johor Bahru
8 999 Roxane F Engineer 7650 Johor Bahru
9 1111 Michael M Office clerk 2900 Kuching
:memo: After removing text, the data is still stored as text (string). We need to convert it to numbers for calculations!
Use df[column_name].astype(type) to change data types.
Example: Convert salary to numbers (integers):
python
df["SALARY"] = df["SALARY"].astype(int)
:memo: Summary
In this lesson, we learned how to clean messy data using Pandas:
- Remove unwanted data with
df.drop()
- Find duplicates with
df[column].is_unique
anddf[column].unique()
- Remove duplicates with
df.drop_duplicates()
- Find missing values with
df.isna()
anddf.isna().sum()
- Remove missing values with
df.dropna()
- Fix formatting with
df[column].str.replace()
- Convert data types with
df[column].astype()
Clean data leads to accurate analysis! :dart:
Video
AI Prompt
Data Cleaning and Preprocessing with Pandas and Prompting
Code with AI: Try using AI for data cleaning tasks.
Prompts:
- "Write Python code to remove missing values from a Pandas DataFrame."
- "Generate code to handle duplicate rows in a DataFrame."
:dart: Practice Time!
Now it's your turn to practice data cleaning:
- Download the practice dataset from the link provided above
- Try each cleaning method we learned:
- Remove the "ID" column (we don't need it for analysis)
- Check for duplicate names and remove them
- Find and remove rows with missing occupation data
- Clean the salary column by removing "RM" and commas, then convert to numbers
- Challenge yourself: Can you clean the entire dataset in less than 10 lines of code? tip Remember: Always check your data before and after cleaning to make sure you didn't accidentally remove important information!