By the end of this lesson, you will be able to:
:information_source: Data manipulation is the process of adjusting data to make it organized and easier to read. It involves rearranging, restructuring, and transforming data to fulfill different needs.
:memo: You can analyze data in two main ways:
- Individual analysis: Finding specific items (e.g., "Who is the best scorer in the NBA?")
- Group analysis: Finding patterns in categories (e.g., "What's the most popular type of mobile app?")
Pandas is a powerful Python library that makes data manipulation easy and intuitive.
In this lesson, we'll master three powerful methods:
Sorting by values :chart_with_upwards_trend:
Grouping data :card_index_dividers:
Filtering data :mag:
We'll use the Olympics dataset to practice these techniques. This dataset contains information about 12,000 Olympic athletes!
:bulb: :emoji: Download the dataset here to follow along with the examples.
The dataset includes these columns:
Sex Age Height Weight Team Year Season Sport Medal M 28 184.0 85.0 Finland 2014 Winter Ice Hockey Bronze M 30 176.0 85.0 Norway 2002 Winter Alpine Skiing Gold M 30 176.0 85.0 Norway 2002 Winter Alpine Skiing Gold M 34 176.0 85.0 Norway 2006 Winter Alpine Skiing Gold M 21 198.0 90.0 Italy 2016 Summer Rowing Bronze M 30 194.0 87.0 Spain 2008 Summer Fencing Bronze F 21 165.0 49.0 Azerbaijan 2016 Summer Taekwondo Bronze F 22 179.0 80.0 Russia 2008 Summer Athletics Silver M 28 180.0 83.0 Belarus 2008 Summer Canoeing Gold M 23 182.0 86.0 France 2008 Summer Handball Gold :chart_with_upwards_trend: Sorting by Values
Most datasets start out randomly arranged. Sorting helps you find patterns and outliers quickly!
Basic Syntax
pythondf.sort_values([column_names])
Key Parameters:
- column_names: Which column(s) to sort by
- ascending:
True
(default) = smallest to largestFalse
= largest to smallest- inplace:
True
= modify the original dataFalse
(default) = create a sorted copy:emoji: Example One: Sort by Height (Ascending)
Find the shortest athletes first:
python
df.sort_values("Height")
Expected output:
text| | ▼ Sex Age Height Weight Team Year Season Sport Medal 11206 F 15 137.0 28.0 China 2008 Summer Diving Gold 11205 F 15 137.0 28.0 China 2008 Summer Diving Bronze 967 F 16 139.0 32.0 Romania 2000 Summer Gymnastics Gold 548 F 18 139.0 36.0 Romania 2004 Summer Gymnastics Gold 4649 F 16 140.0 32.0 China 2008 Summer Gymnastics Gold
:emoji: Example 2: Sort by Height (Descending)
Find the tallest athletes first:
python
df.sort_values("Height", ascending = False)
OR
python
df.sort_values("Height", ascending = 0)
Expected output:
text| | ▼ Sex Age Height Weight Team Year Season Sport Medal 7217 M 23 219.0 104.0 Russia 2012 Summer Volleyball Gold 12032 M 27 218.0 115.0 Lithuania 2000 Summer Basketball Bronze 11290 M 23 218.0 110.0 France 2000 Summer Basketball Silver 4932 M 28 217.0 102.0 Russia 2004 Summer Volleyball Bronze 4931 M 24 217.0 102.0 Russia 2000 Summer Volleyball Silver
:emoji: Example 3: Sort by Multiple Columns
First sort by age (oldest first), then by height (shortest to tallest):
python
df.sort_values(["Age", "Height"], ascending = [0,1])
Expected output:
text| | | | ▼ ▼ Sex Age Height Weight Team Year Season Sport Medal 6905 M 61 185.0 76.0 Canada 2008 Summer Equestrianism Silver 9574 M 58 175.0 76.0 Great Britain 2016 Summer Equestrianism Gold 10453 M 56 190.0 78.0 New Zealand 2012 Summer Equestrianism Bronze 1756 M 55 188.0 82.0 Canada 2008 Summer Equestrianism Silver 9573 M 54 175.0 76.0 Great Britain 2012 Summer Equestrianism Gold ... .. ... ... ... ... ... ... ... ... 6783 F 15 176.0 75.0 Lithuania 2012 Summer Swimming Gold 4732 F 15 177.0 68.0 Australia 2000 Summer Swimming Silver 4733 F 15 177.0 68.0 Australia 2000 Summer Swimming Silver 5818 F 15 183.0 73.0 United States 2012 Summer Swimming Gold 3681 M 15 185.0 77.0 Hungary 2004 Summer Swimming Silver
:card_index_dividers: Grouping Data
Grouping helps you see patterns by category. Think of it like sorting your toys into different boxes!
Basic Syntax
python
df.groupby([column_names]).type()
Three Types of Group Analysis:
- count() :bar_chart:
- Counts how many items in each group
- Example: "How many athletes from each country?"
- sum() :emoji:
- Adds up all values in each group
- Example: "Total medals per country"
- mean() :chart_with_upwards_trend:
- Finds the average for each group
- Example: "Average height per sport"
Example: Group Athletes by Country
:bar_chart: Count Athletes per Country
python
df.groupby("Team").count()
Expected output:
textSex Age Height Weight Year Season Sport Medal Team Afghanistan 2 2 2 2 2 2 2 2 Algeria 10 10 10 10 10 10 10 10 Argentina 161 161 161 161 161 161 161 161 Armenia 13 13 13 13 13 13 13 13 Australia 693 693 693 693 693 693 693 693 ... ... ... ... ... ... ... ... ... Uruguay 1 1 1 1 1 1 1 1 Uzbekistan 31 31 31 31 31 31 31 31 Venezuela 7 7 7 7 7 7 7 7 Vietnam 4 4 4 4 4 4 4 4 Zimbabwe 7 7 7 7 7 7 7 7
:emoji: Sum Values per Country
python
df.groupby("Team").sum(True)
Expected output:
textAge Height Weight Year Team Afghanistan 46 366.0 126.0 4020 Algeria 252 1748.0 670.0 20060 Argentina 4328 28244.0 11604.0 323184 Armenia 310 2284.0 1137.0 26140 Australia 17794 124592.0 51918.0 1390588 ... ... ... ... ... Uruguay 28 169.0 69.0 2000 Uzbekistan 763 5461.0 2451.0 62312 Venezuela 167 1197.0 456.0 14076 Vietnam 131 659.0 253.0 8040 Zimbabwe 156 1232.0 448.0 14044
:chart_with_upwards_trend: Average Values per Country
python
df.groupby("Team").mean(True)
Expected output:
textAge Height Weight Year Team Afghanistan 23.000000 183.000000 63.000000 2010.000000 Algeria 25.200000 174.800000 67.000000 2006.000000 Argentina 26.881988 175.428571 72.074534 2007.354037 Armenia 23.846154 175.692308 87.461538 2010.769231 Australia 25.676768 179.786436 74.917749 2006.620491 ... ... ... ... ... Uruguay 28.000000 169.000000 69.000000 2000.000000 Uzbekistan 24.612903 176.161290 79.064516 2010.064516 Venezuela 23.857143 171.000000 65.142857 2010.857143 Vietnam 32.750000 164.750000 63.250000 2010.000000 Zimbabwe 22.285714 176.000000 64.000000 2006.285714
:emoji: Grouping by Multiple Columns
You can group by team AND medal type to get more detailed insights:
python
df.groupby(["Team", "Medal"]).mean(True)
Expected output:
textAge Height Weight Year Team Medal Afghanistan Bronze 23.000000 183.000000 63.000000 2010.000000 Algeria Bronze 23.750000 176.250000 63.750000 2002.000000 Gold 26.500000 166.000000 60.500000 2006.000000 Silver 26.000000 177.750000 73.500000 2010.000000 Argentina Bronze 28.482759 174.551724 70.689655 2006.344828 ... ... ... ... ... Vietnam Gold 41.000000 175.000000 75.000000 2016.000000 Silver 30.000000 161.333333 59.333333 2008.000000 Zimbabwe Bronze 20.000000 176.000000 64.000000 2004.000000 Gold 22.000000 176.000000 64.000000 2006.000000 Silver 23.000000 176.000000 64.000000 2007.000000
:bar_chart: Combining Grouping with Sorting
You can sort your grouped results to find extremes. For example, which countries have the shortest or tallest athletes on average?
python
df.groupby("Team").mean(True).sort_values("Height")
Expected output:
text| | ▼ Age Height Weight Year Team Philippines 25.000000 149.000000 53.000000 2016.000000 North Korea 23.571429 159.357143 61.000000 2009.000000 Indonesia 22.750000 159.937500 61.750000 2005.750000 Eritrea 22.000000 160.000000 54.000000 2004.000000 Thailand 23.583333 161.166667 56.208333 2008.166667 ... ... ... ... ... Serbia 26.552941 191.317647 88.976471 2013.835294 Serbia and Montenegro 27.078947 192.815789 94.342105 2001.368421 Sudan 23.000000 193.000000 72.000000 2008.000000 Netherlands-1 27.000000 202.500000 90.000000 2016.000000 Niger 21.000000 207.000000 90.000000 2016.000000
tip All sorting parameters (like
ascending=False
) work with grouped data too!
Filtering lets you focus on exactly the data you need. It's like using a search filter when shopping online!
df[conditions]
==
: equals!=
: not equals>
: greater than<
: less than>=
: greater than or equal<=
: less than or equal&
: AND (both conditions must be true)|
: OR (at least one condition must be true)~
: NOT (opposite of the condition)Find all Malaysian athletes:
df[(df["Team"] == "Malaysia")]
Expected output:
| |
▼
Sex Age Height Weight Team Year Season Sport Medal
447 M 28 166.0 69.0 Malaysia 2016 Summer Cycling Bronze
1517 M 28 175.0 68.0 Malaysia 2016 Summer Badminton Silver
1597 F 26 150.0 48.0 Malaysia 2016 Summer Diving Silver
3358 F 27 166.0 56.0 Malaysia 2016 Summer Badminton Silver
3359 M 27 180.0 70.0 Malaysia 2016 Summer Badminton Silver
5830 M 25 170.0 68.0 Malaysia 2008 Summer Badminton Silver
5831 M 29 170.0 68.0 Malaysia 2012 Summer Badminton Silver
5832 M 33 170.0 68.0 Malaysia 2016 Summer Badminton Silver
8677 F 19 159.0 55.0 Malaysia 2012 Summer Diving Bronze
8678 F 23 159.0 55.0 Malaysia 2016 Summer Diving Silver
10178 M 27 177.0 77.0 Malaysia 2016 Summer Badminton Silver
Find Malaysian athletes who are 28 or older:
df[(df["Team"] == "Malaysia") & (df["Age"] >= 28)]
Expected output:
| |
| |
▼ ▼
Sex Age Height Weight Team Year Season Sport Medal
447 M 28 166.0 69.0 Malaysia 2016 Summer Cycling Bronze
1517 M 28 175.0 68.0 Malaysia 2016 Summer Badminton Silver
5831 M 29 170.0 68.0 Malaysia 2012 Summer Badminton Silver
5832 M 33 170.0 68.0 Malaysia 2016 Summer Badminton Silver
:bulb: Always save filtered data to a new variable to keep your original data safe!
python
df_malaysia = df[(df["Team"] == "Malaysia")]
This creates a new DataFrame with only Malaysian athletes, while keeping the original
df
unchanged.:rocket: Keep Learning!
Want to discover more powerful data manipulation techniques? Check out the Pandas Documentation.
:memo: Summary
You've learned three essential data manipulation techniques:
- Sorting - Organize data in meaningful order
- Grouping - Analyze data by categories
- Filtering - Focus on specific subsets of data
These skills form the foundation of data analysis. With practice, you'll be able to uncover insights hidden in any dataset!
:movie_camera: Video Tutorial
:emoji: AI Practice Prompts
Data Manipulation with Pandas: Sorting, Filtering, and Transforming
Practice with AI to strengthen your skills! Try these prompts:
:bar_chart: Sorting Practice
- "How do I sort a Pandas DataFrame by age in descending order?"
- "Show me how to sort by multiple columns in Pandas"
:mag: Filtering Practice
- "Generate code to filter a DataFrame for athletes taller than 180cm"
- "How can I filter for gold medal winners from the USA?"
:card_index_dividers: Grouping Practice
- "Create code to find the average age of athletes by sport"
- "How do I count medals won by each country?" tip :bulb: Challenge: Try combining all three techniques! Filter for a specific sport, group by country, and sort by average height.