Practice and reinforce the concepts from Lesson 17
Time estimate: 45-60 minutes
:warning: Important Always create a copy of the Colab notebook before starting. This ensures:
- Your work is saved to your Google Drive
- You don't accidentally modify the template
- You can return to your work later
pd
np
pd.read_csv()
to load the provided datasetdf
:bulb: If you get a "file not found" error, check that the CSV file path is correct in your Colab environment.
Step 3: Explore Your Data (10 minutes)
- Use
df.head()
to view the first 5 rows- Use
df.columns
to list all column names- Use
df.info()
to see data types and missing values- Use
df.shape
to check the dataset dimensions- Take notes on what needs cleaning
Step 4: Remove Unnecessary Columns (5 minutes)
- Identify columns that won't be used in analysis
- Use
df.drop()
with the column names- Set
axis=1
to indicate you're dropping columns- Verify columns are removed with
df.columns
Step 5: Handle Duplicate Values (5 minutes)
- Check for duplicates using
df.duplicated().sum()
- View duplicate rows with
df[df.duplicated()]
- Remove duplicates using
df.drop_duplicates()
- Verify removal by checking the shape again tip Common Challenge Sometimes you may want to keep duplicates based on certain columns only. Use the
subset
parameter indrop_duplicates()
to specify which columns to check.
df.isnull().sum()
df.dropna()
df.fillna(df.mean())
df.fillna(df.mode())
df.dtypes
pd.to_numeric()
pd.to_datetime()
.astype('category')
df.dtypes
:bulb: Tip Use
errors='coerce'
in conversion functions to handle invalid values gracefully.
df.to_csv('cleaned_data.csv', index=False)
Problem: "No module named pandas" error
!pip install pandas
in a cell firstProblem: Data types not converting properly
errors='coerce'
to handle problematic valuesProblem: Memory errors with large datasets
chunksize
parameter when reading CSVdf.sample()
Problem: Cleaned data not saving
:information_source: Helpful Resources
:warning: Before You Submit :white_check_mark: Ensure all code cells have been run successfully :white_check_mark: Your cleaned dataset is exported and downloadable :white_check_mark: You've made a copy of the Colab notebook :white_check_mark: All steps are completed with comments explaining your approach
:information_source: Submission Checklist
- Colab notebook link
- Cleaned CSV file
- Brief summary of cleaning steps taken
- Any challenges faced and how you solved them