Demo Mode

Lesson 17 of 20

Concept 17: Data Cleaning with Pandas

What is Data Cleaning

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand what data cleaning is and why it's important
Remove unwanted columns and rows from datasets
Find and remove duplicate data
Handle missing values in your data
Convert data into formats that computers can understand

:information_source: Definition: Data cleaning is the process of finding and fixing incorrect data in datasets. Think of it like proofreading and editing a document, but for data!

Why Do We Need Data Cleaning? :emoji:

Datasets often contain:

Incorrectly formatted data (like "3,000" instead of 3000)
Duplicate data (the same information appears multiple times)
Incomplete data (missing values)

Data cleaning helps us fix these problems to get accurate results!

Data Cleaning Methods with Pandas :hammer_and_wrench:

We'll learn 4 powerful methods to clean data:

Removing Columns and Rows - Get rid of information you don't need
Removing Duplicates - Delete repeated data
Removing Missing Values - Handle empty cells
Replacing and Converting Data - Fix formatting issues

Our Practice Dataset :bar_chart:

We'll use salary_data.csv to practice data cleaning. You can download it here.

This dataset contains information about 10 workers including:

ID number
Name
Occupation (job)
Gender
Salary
City

:memo: Notice that our dataset has some problems! It contains missing values (NaN), invalid data, and duplicates. Let's learn how to fix these issues!

output
    ID    NAME GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700   Kuala Lumpur
2  333   Alice      F  Accountant   RM4,500        Malacca
3  444    Mark      M         NaN       NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350   Kuala Lumpur
5  666   Peter      M      Banker   RM5,650   Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000    George Town
7  888  Roxane      F    Engineer   RM7,650    Johor Bahru
8  999  Roxane      F    Engineer   RM7,650    Johor Bahru
9 1111 Michael      M Office clerk  RM2,900        Kuching
1️⃣ Removing Columns and Rows

Sometimes datasets have too much information. We need to keep only what's important and remove the rest.

How to Remove Data

Use the df.drop() command with these options:

index=[index_names] - Remove specific rows by their index number

columns=[column_names] - Remove specific columns by their name

inplace=True/False - Whether to save changes to the original data

Examples:

Remove a row (row with index 1):
python
df.drop(index=[1])
Expected output:
output
    ID    NAME GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700   Kuala Lumpur
2  333   Alice      F  Accountant   RM4,500        Malacca   <------
3  444    Mark      M         NaN       NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350   Kuala Lumpur
5  666   Peter      M      Banker   RM5,650   Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000    George Town
7  888  Roxane      F    Engineer   RM7,650    Johor Bahru
8  999  Roxane      F    Engineer   RM7,650    Johor Bahru
9 1111 Michael      M Office clerk  RM2,900        Kuching
Remove a column (the "GENDER" column) and save the changes:
python
df.drop(columns=['GENDER'], inplace=True)
Expected output:
output
              | |
              ▼
    ID   NAME  OCCUPATION   SALARY          CITY
0  111   John    Engineer  RM3,700  Kuala Lumpur
1  222   Mery  Businessman  RM6,900   George Town
2  333  Alice  Accountant  RM4,500       Malacca
3  444   Mark         NaN      NaN  Kota Kinabalu
4  555   Alex     Surgeon  RM9,350  Kuala Lumpur
5  666  Peter      Banker  RM5,650  Kuala Lumpur
6  777 Calvin     Actuary  RM7,000   George Town
7  888 Roxane    Engineer  RM7,650    Johor Bahru
8  999 Roxane    Engineer  RM7,650    Johor Bahru
9 1111 Michael Office clerk  RM2,900       Kuching
2️⃣ Finding Duplicate Data :mag:

When we combine different datasets, we often get duplicate (repeated) information. Let's learn how to find these duplicates!

Check if Data is Unique

Use df[column_name].is_unique to check for duplicates:

Returns True = No duplicates :white_check_mark:

Returns False = Has duplicates :x:

Example: Check if all cities are unique:
python
df["CITY"].is_unique
The output is False because some cities appear more than once (like Kuala Lumpur).

See All Unique Values

Use df[column_name].unique() to see all unique values in a column.

Example: See all unique cities:
python
df["CITY"].unique()
Expected output:
output
array(['Kuala Lumpur', 'George Town', 'Malacca', 'Kota Kinabalu',
       'Johor Bahru', 'Kuching'], dtype=object)
3️⃣ Removing Duplicates :emoji:️

Now let's remove those duplicate rows!

The drop_duplicates() Command

Use df.drop_duplicates() with these options:

subset=[column_names] - Which columns to check for duplicates

keep="first" or "last" - Keep the first or last duplicate (default is "first")

inplace=True/False - Save changes to original data

Examples:

Remove duplicate names (keeping the first occurrence):
python
df.drop_duplicates(subset=["NAME"])
Expected output:
output
    ID   NAME  GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700  Kuala Lumpur
1  222    Mery      F  Businessman  RM6,900   George Town
2  333   Alice      F  Accountant   RM4,500       Malacca
3  444    Mark      M         NaN      NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350  Kuala Lumpur
5  666   Peter      M      Banker   RM5,650  Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000   George Town
7  888  Roxane      F    Engineer   RM7,650   Johor Bahru <-----
9 1111 Michael      M Office clerk  RM2,900       Kuching
Remove duplicate names (keeping the last occurrence):
python
df.drop_duplicates(subset=["NAME"], keep="last")
Expected output:
output
    ID   NAME  GENDER  OCCUPATION    SALARY          CITY
0  111    John      M    Engineer   RM3,700  Kuala Lumpur
1  222    Mery      F  Businessman  RM6,900   George Town
2  333   Alice      F  Accountant   RM4,500       Malacca
3  444    Mark      M         NaN      NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350  Kuala Lumpur
5  666   Peter      M      Banker   RM5,650  Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000   George Town
8  999  Roxane      F    Engineer   RM7,650   Johor Bahru < -------
9 1111 Michael      M Office clerk  RM2,900       Kuching
Remove duplicates and save changes:
python
df.drop_duplicates(subset=["NAME"], inplace=True)
4️⃣ Finding Missing Values :emoji:️

Datasets often have empty cells or missing information. These appear as:

"na"

"n/a"

"NaN" (Not a Number)

"NA" (Not Available) tip Pandas only recognizes "NA" and "NaN" as missing values. Other formats need to be converted first!

Check for Missing Data

Use df.isna() to find missing values:

Returns True for missing data
Returns False for complete data

Example:

python

df.isna()

Expected output:

output

      ID   NAME GENDER OCCUPATION SALARY   CITY
0  False  False  False     False  False  False
1  False  False  False     False  False  False
2  False  False  False     False  False  False
3  False  False  False      True   True  False
4  False  False  False     False  False  False
5  False  False  False     False  False  False
6  False  False  False     False  False  False
7  False  False  False     False  False  False
8  False  False  False     False  False  False
9  False  False  False     False  False  False

Count Missing Values

Use df.isna().sum() to count missing values in each column. Any number greater than 0 means that column has missing data.

Example:

python

df.isna().sum()

Expected output:

output

ID            0
NAME          0
GENDER        0
OCCUPATION    1
SALARY        1
CITY          0
dtype: int64

5️⃣ Removing Missing Values :emoji:

Now let's remove rows that have missing data!

The dropna() Command

Use df.dropna() with these options:

subset=[column_names] - Which columns to check for missing values
inplace=True/False - Save changes to original data

Examples:

Remove rows with missing salary data:

python

df.dropna(subset=["SALARY"])

Expected output:

output

    ID   NAME GENDER  OCCUPATION   SALARY          CITY
0  111   John     M    Engineer   RM3,700  Kuala Lumpur
1  222   Mery     F  Businessman  RM6,900   George Town
2  333  Alice     F  Accountant   RM4,500       Malacca <-----
4  555   Alex     A     Surgeon   RM9,350  Kuala Lumpur
5  666  Peter     M      Banker   RM5,650  Kuala Lumpur
6  777 Calvin     M     Actuary   RM7,000   George Town
7  888 Roxane     F    Engineer   RM7,650    Johor Bahru
8  999 Roxane     F    Engineer   RM7,650    Johor Bahru
9 1111 Michael    M Office clerk  RM2,900       Kuching

Notice that row 3 disappeared because it had a missing salary value!

Remove missing values and save changes:

python

df.dropna(subset=["SALARY"], inplace=True)

6️⃣ Replacing and Converting Data :emoji:

Sometimes data isn't in a format computers can understand. For example:

"3,000" instead of 3000
"RM5,000" instead of 5000
"3.4M" instead of 3400000

Replace Text in Data

Use df[column_name].str.replace(old_text, new_text) to fix formatting.

Example One: Remove "RM" from salary data:

python

df["SALARY"] = df["SALARY"].str.replace("RM", "")

Expected output:

output

                                     | |
                                     ▼
    ID   NAME   GENDER  OCCUPATION SALARY      CITY
0  111    John      M     Engineer  3,700  Kuala Lumpur
1  222    Mery      F  Businessman  6,900   George Town
2  333   Alice      F   Accountant  4,500       Malacca
4  555    Alex      A      Surgeon  9,350  Kuala Lumpur
5  666   Peter      M       Banker  5,650  Kuala Lumpur
6  777  Calvin      M      Actuary  7,000   George Town
7  888  Roxane      F     Engineer  7,650   Johor Bahru
8  999  Roxane      F     Engineer  7,650   Johor Bahru
9 1111 Michael      M Office clerk  2,900       Kuching

Example 2: Remove commas from salary data:

python

df["SALARY"] = df["SALARY"].str.replace(",", "")

Expected output:

output

                                     | |
                                     ▼
    ID   NAME   GENDER  OCCUPATION SALARY      CITY
0  111    John      M     Engineer  3700  Kuala Lumpur
1  222    Mery      F  Businessman  6900   George Town
2  333   Alice      F   Accountant  4500       Malacca
4  555    Alex      A      Surgeon  9350  Kuala Lumpur
5  666   Peter      M       Banker  5650  Kuala Lumpur
6  777  Calvin      M      Actuary  7000   George Town
7  888  Roxane      F     Engineer  7650   Johor Bahru
8  999  Roxane      F     Engineer  7650   Johor Bahru
9 1111 Michael      M Office clerk  2900       Kuching

Convert Data Types

:memo: After removing text, the data is still stored as text (string). We need to convert it to numbers for calculations!

Use df[column_name].astype(type) to change data types.

Example: Convert salary to numbers (integers):
python
df["SALARY"] = df["SALARY"].astype(int)
:memo: Summary

In this lesson, we learned how to clean messy data using Pandas:

Remove unwanted data with df.drop()

Find duplicates with df[column].is_unique and df[column].unique()

Remove duplicates with df.drop_duplicates()

Find missing values with df.isna() and df.isna().sum()

Remove missing values with df.dropna()

Fix formatting with df[column].str.replace()

Convert data types with df[column].astype()

Clean data leads to accurate analysis! :dart:

Video

AI Prompt

Data Cleaning and Preprocessing with Pandas and Prompting

Code with AI: Try using AI for data cleaning tasks.

Prompts:

"Write Python code to remove missing values from a Pandas DataFrame."

"Generate code to handle duplicate rows in a DataFrame."

:dart: Practice Time!

Now it's your turn to practice data cleaning:

Download the practice dataset from the link provided above

Try each cleaning method we learned:

Remove the "ID" column (we don't need it for analysis)

Check for duplicate names and remove them

Find and remove rows with missing occupation data

Clean the salary column by removing "RM" and commas, then convert to numbers

Challenge yourself: Can you clean the entire dataset in less than 10 lines of code? tip Remember: Always check your data before and after cleaning to make sure you didn't accidentally remove important information!

Lesson 17 of 20

Concept 17: Data Cleaning with Pandas

What is Data Cleaning

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand what data cleaning is and why it's important
Remove unwanted columns and rows from datasets
Find and remove duplicate data
Handle missing values in your data
Convert data into formats that computers can understand

:information_source: Definition: Data cleaning is the process of finding and fixing incorrect data in datasets. Think of it like proofreading and editing a document, but for data!

Why Do We Need Data Cleaning? :emoji:

Datasets often contain:

Incorrectly formatted data (like "3,000" instead of 3000)
Duplicate data (the same information appears multiple times)
Incomplete data (missing values)

Data cleaning helps us fix these problems to get accurate results!

Data Cleaning Methods with Pandas :hammer_and_wrench:

We'll learn 4 powerful methods to clean data:

Removing Columns and Rows - Get rid of information you don't need
Removing Duplicates - Delete repeated data
Removing Missing Values - Handle empty cells
Replacing and Converting Data - Fix formatting issues

Our Practice Dataset :bar_chart:

We'll use salary_data.csv to practice data cleaning. You can download it here.

This dataset contains information about 10 workers including:

ID number
Name
Occupation (job)
Gender
Salary
City

:memo: Notice that our dataset has some problems! It contains missing values (NaN), invalid data, and duplicates. Let's learn how to fix these issues!

output
    ID    NAME GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700   Kuala Lumpur
2  333   Alice      F  Accountant   RM4,500        Malacca
3  444    Mark      M         NaN       NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350   Kuala Lumpur
5  666   Peter      M      Banker   RM5,650   Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000    George Town
7  888  Roxane      F    Engineer   RM7,650    Johor Bahru
8  999  Roxane      F    Engineer   RM7,650    Johor Bahru
9 1111 Michael      M Office clerk  RM2,900        Kuching
1️⃣ Removing Columns and Rows

Sometimes datasets have too much information. We need to keep only what's important and remove the rest.

How to Remove Data

Use the df.drop() command with these options:

index=[index_names] - Remove specific rows by their index number

columns=[column_names] - Remove specific columns by their name

inplace=True/False - Whether to save changes to the original data

Examples:

Remove a row (row with index 1):
python
df.drop(index=[1])
Expected output:
output
    ID    NAME GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700   Kuala Lumpur
2  333   Alice      F  Accountant   RM4,500        Malacca   <------
3  444    Mark      M         NaN       NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350   Kuala Lumpur
5  666   Peter      M      Banker   RM5,650   Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000    George Town
7  888  Roxane      F    Engineer   RM7,650    Johor Bahru
8  999  Roxane      F    Engineer   RM7,650    Johor Bahru
9 1111 Michael      M Office clerk  RM2,900        Kuching
Remove a column (the "GENDER" column) and save the changes:
python
df.drop(columns=['GENDER'], inplace=True)
Expected output:
output
              | |
              ▼
    ID   NAME  OCCUPATION   SALARY          CITY
0  111   John    Engineer  RM3,700  Kuala Lumpur
1  222   Mery  Businessman  RM6,900   George Town
2  333  Alice  Accountant  RM4,500       Malacca
3  444   Mark         NaN      NaN  Kota Kinabalu
4  555   Alex     Surgeon  RM9,350  Kuala Lumpur
5  666  Peter      Banker  RM5,650  Kuala Lumpur
6  777 Calvin     Actuary  RM7,000   George Town
7  888 Roxane    Engineer  RM7,650    Johor Bahru
8  999 Roxane    Engineer  RM7,650    Johor Bahru
9 1111 Michael Office clerk  RM2,900       Kuching
2️⃣ Finding Duplicate Data :mag:

When we combine different datasets, we often get duplicate (repeated) information. Let's learn how to find these duplicates!

Check if Data is Unique

Use df[column_name].is_unique to check for duplicates:

Returns True = No duplicates :white_check_mark:

Returns False = Has duplicates :x:

Example: Check if all cities are unique:
python
df["CITY"].is_unique
The output is False because some cities appear more than once (like Kuala Lumpur).

See All Unique Values

Use df[column_name].unique() to see all unique values in a column.

Example: See all unique cities:
python
df["CITY"].unique()
Expected output:
output
array(['Kuala Lumpur', 'George Town', 'Malacca', 'Kota Kinabalu',
       'Johor Bahru', 'Kuching'], dtype=object)
3️⃣ Removing Duplicates :emoji:️

Now let's remove those duplicate rows!

The drop_duplicates() Command

Use df.drop_duplicates() with these options:

subset=[column_names] - Which columns to check for duplicates

keep="first" or "last" - Keep the first or last duplicate (default is "first")

inplace=True/False - Save changes to original data

Examples:

Remove duplicate names (keeping the first occurrence):
python
df.drop_duplicates(subset=["NAME"])
Expected output:
output
    ID   NAME  GENDER  OCCUPATION   SALARY          CITY
0  111    John      M    Engineer   RM3,700  Kuala Lumpur
1  222    Mery      F  Businessman  RM6,900   George Town
2  333   Alice      F  Accountant   RM4,500       Malacca
3  444    Mark      M         NaN      NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350  Kuala Lumpur
5  666   Peter      M      Banker   RM5,650  Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000   George Town
7  888  Roxane      F    Engineer   RM7,650   Johor Bahru <-----
9 1111 Michael      M Office clerk  RM2,900       Kuching
Remove duplicate names (keeping the last occurrence):
python
df.drop_duplicates(subset=["NAME"], keep="last")
Expected output:
output
    ID   NAME  GENDER  OCCUPATION    SALARY          CITY
0  111    John      M    Engineer   RM3,700  Kuala Lumpur
1  222    Mery      F  Businessman  RM6,900   George Town
2  333   Alice      F  Accountant   RM4,500       Malacca
3  444    Mark      M         NaN      NaN  Kota Kinabalu
4  555    Alex      A     Surgeon   RM9,350  Kuala Lumpur
5  666   Peter      M      Banker   RM5,650  Kuala Lumpur
6  777  Calvin      M     Actuary   RM7,000   George Town
8  999  Roxane      F    Engineer   RM7,650   Johor Bahru < -------
9 1111 Michael      M Office clerk  RM2,900       Kuching
Remove duplicates and save changes:
python
df.drop_duplicates(subset=["NAME"], inplace=True)
4️⃣ Finding Missing Values :emoji:️

Datasets often have empty cells or missing information. These appear as:

"na"

"n/a"

"NaN" (Not a Number)

"NA" (Not Available) tip Pandas only recognizes "NA" and "NaN" as missing values. Other formats need to be converted first!

Check for Missing Data

Use df.isna() to find missing values:

Returns True for missing data
Returns False for complete data

Example:

python

df.isna()

Expected output:

output

      ID   NAME GENDER OCCUPATION SALARY   CITY
0  False  False  False     False  False  False
1  False  False  False     False  False  False
2  False  False  False     False  False  False
3  False  False  False      True   True  False
4  False  False  False     False  False  False
5  False  False  False     False  False  False
6  False  False  False     False  False  False
7  False  False  False     False  False  False
8  False  False  False     False  False  False
9  False  False  False     False  False  False

Count Missing Values

Use df.isna().sum() to count missing values in each column. Any number greater than 0 means that column has missing data.

Example:

python

df.isna().sum()

Expected output:

output

ID            0
NAME          0
GENDER        0
OCCUPATION    1
SALARY        1
CITY          0
dtype: int64

5️⃣ Removing Missing Values :emoji:

Now let's remove rows that have missing data!

The dropna() Command

Use df.dropna() with these options:

subset=[column_names] - Which columns to check for missing values
inplace=True/False - Save changes to original data

Examples:

Remove rows with missing salary data:

python

df.dropna(subset=["SALARY"])

Expected output:

output

    ID   NAME GENDER  OCCUPATION   SALARY          CITY
0  111   John     M    Engineer   RM3,700  Kuala Lumpur
1  222   Mery     F  Businessman  RM6,900   George Town
2  333  Alice     F  Accountant   RM4,500       Malacca <-----
4  555   Alex     A     Surgeon   RM9,350  Kuala Lumpur
5  666  Peter     M      Banker   RM5,650  Kuala Lumpur
6  777 Calvin     M     Actuary   RM7,000   George Town
7  888 Roxane     F    Engineer   RM7,650    Johor Bahru
8  999 Roxane     F    Engineer   RM7,650    Johor Bahru
9 1111 Michael    M Office clerk  RM2,900       Kuching

Notice that row 3 disappeared because it had a missing salary value!

Remove missing values and save changes:

python

df.dropna(subset=["SALARY"], inplace=True)

6️⃣ Replacing and Converting Data :emoji:

Sometimes data isn't in a format computers can understand. For example:

"3,000" instead of 3000
"RM5,000" instead of 5000
"3.4M" instead of 3400000

Replace Text in Data

Use df[column_name].str.replace(old_text, new_text) to fix formatting.

Example One: Remove "RM" from salary data:

python

df["SALARY"] = df["SALARY"].str.replace("RM", "")

Expected output:

output

                                     | |
                                     ▼
    ID   NAME   GENDER  OCCUPATION SALARY      CITY
0  111    John      M     Engineer  3,700  Kuala Lumpur
1  222    Mery      F  Businessman  6,900   George Town
2  333   Alice      F   Accountant  4,500       Malacca
4  555    Alex      A      Surgeon  9,350  Kuala Lumpur
5  666   Peter      M       Banker  5,650  Kuala Lumpur
6  777  Calvin      M      Actuary  7,000   George Town
7  888  Roxane      F     Engineer  7,650   Johor Bahru
8  999  Roxane      F     Engineer  7,650   Johor Bahru
9 1111 Michael      M Office clerk  2,900       Kuching

Example 2: Remove commas from salary data:

python

df["SALARY"] = df["SALARY"].str.replace(",", "")

Expected output:

output

                                     | |
                                     ▼
    ID   NAME   GENDER  OCCUPATION SALARY      CITY
0  111    John      M     Engineer  3700  Kuala Lumpur
1  222    Mery      F  Businessman  6900   George Town
2  333   Alice      F   Accountant  4500       Malacca
4  555    Alex      A      Surgeon  9350  Kuala Lumpur
5  666   Peter      M       Banker  5650  Kuala Lumpur
6  777  Calvin      M      Actuary  7000   George Town
7  888  Roxane      F     Engineer  7650   Johor Bahru
8  999  Roxane      F     Engineer  7650   Johor Bahru
9 1111 Michael      M Office clerk  2900       Kuching

Convert Data Types

:memo: After removing text, the data is still stored as text (string). We need to convert it to numbers for calculations!

Use df[column_name].astype(type) to change data types.

Example: Convert salary to numbers (integers):
python
df["SALARY"] = df["SALARY"].astype(int)
:memo: Summary

In this lesson, we learned how to clean messy data using Pandas:

Remove unwanted data with df.drop()

Find duplicates with df[column].is_unique and df[column].unique()

Remove duplicates with df.drop_duplicates()

Find missing values with df.isna() and df.isna().sum()

Remove missing values with df.dropna()

Fix formatting with df[column].str.replace()

Convert data types with df[column].astype()

Clean data leads to accurate analysis! :dart:

Video

AI Prompt

Data Cleaning and Preprocessing with Pandas and Prompting

Code with AI: Try using AI for data cleaning tasks.

Prompts:

"Write Python code to remove missing values from a Pandas DataFrame."

"Generate code to handle duplicate rows in a DataFrame."

:dart: Practice Time!

Now it's your turn to practice data cleaning:

Download the practice dataset from the link provided above

Try each cleaning method we learned:

Remove the "ID" column (we don't need it for analysis)

Check for duplicate names and remove them

Find and remove rows with missing occupation data

Clean the salary column by removing "RM" and commas, then convert to numbers

Challenge yourself: Can you clean the entire dataset in less than 10 lines of code? tip Remember: Always check your data before and after cleaning to make sure you didn't accidentally remove important information!