Demo Mode

Lesson 15 of 20

Concept 15: Pandas

What is Pandas :emoji:

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand what Pandas is and why it's important for data analysis
Create and work with Pandas DataFrames
Access specific rows and columns in your data
Calculate basic statistics from your data
Find relationships between different data columns

:information_source: Definition: Pandas is a Python package that provides high-performance, easy-to-use data structures and data analysis tools. Think of it as a super-powered spreadsheet that you can control with code!

Why Use Pandas?

Pandas helps you work with data - you can clean it, analyze it, and prepare it for machine learning projects.
Pandas is built on top of NumPy, which makes it super fast at handling large amounts of data.
There are two main data structures in Pandas:
- Pandas Series: A one-dimensional labeled array (like a single column of data)
- Pandas DataFrame: A 2-dimensional labeled data structure (like a whole spreadsheet)

:bulb: When you change a DataFrame to view it differently, the original data stays safe! It only changes if you save it back to the same variable using =.

Learn more about Pandas at the official website: https://pandas.pydata.org/

Pandas DataFrame :bar_chart:

Most data in the world is stored in tables (like spreadsheets).

A Pandas DataFrame creates a table with labeled rows and columns.

Every DataFrame has three main parts:

Data: The actual information or values in your table

Index: Row labels (starting from 0 by default, like counting)

Columns: Column names that describe what each column contains

How to Use Pandas :rocket:

Let's learn Pandas step by step:

Install and Import Pandas: Getting Pandas ready to use

Create Pandas DataFrame: Making your own data tables

Basic syntax of Pandas DataFrame: Essential commands you need to know

Access columns of Pandas DataFrame: Getting data from specific columns

Access rows of Pandas DataFrame: Getting data from specific rows

Correlation of data columns: Finding relationships between data

Install and Import Pandas :package:
To use Pandas, VSCode users need to install it first. Type this command in your terminal:
py -m pip install pandas
note Google Colab users: Good news! Pandas is already installed for you.

Since Pandas needs NumPy to work, we import both packages:

python

import numpy as np
import pandas as pd

This code imports the packages we need

Create Pandas DataFrame :emoji:️

To create a DataFrame, we use pd.DataFrame(data, index, columns). Here's how to make a simple score table:

python

df = pd.DataFrame(
    [
        ["Adam", 87, 78, 90],
        ["Bob", 42, 51, 66],
        ["Crystal", 68, 50, 42],
        ["David", 99, 86, 83],
        ["Edmund", 53, 70, 91]
    ],
    columns=["students", "math", "science", "english"]
)

# Display the DataFrame
print(df)

This creates a table of student scores

When you print the DataFrame, it looks like this:

python

print(df)

Expected output:

text

  students  math  science  english
0     Adam    87       78       90
1      Bob    42       51       66
2  Crystal    68       50       42
3    David    99       86       83
4   Edmund    53       70       91

The DataFrame displays as a neat table

You can also create DataFrames using dictionaries. This method is often easier to read:

python

score_data = {
    "students" : ["Adam", "Bob", "Crystal", "David", "Edmund"],
    "math" : [87, 42, 68, 99, 53],
    "science" : [78, 51, 50, 86, 70],
    "english" : [90, 66, 42, 83, 91]
}
df = pd.DataFrame(score_data)

Using a dictionary to create the same DataFrame

Both methods create the exact same table!

Basic DataFrame Commands :wrench:

Here are 4 essential commands to explore your DataFrame:
- columns: Shows all column names
- index: Shows all row labels
- head(): Shows the first 5 rows (great for peeking at your data!)
- info(): Shows basic information about your DataFrame
Let's try each command:

python

df.columns

Expected output:

text

Index(['students', 'math', 'science', 'english'], dtype='object')

Shows all column names in your DataFrame

python

df.index

Expected output:

text

RangeIndex(start=0, stop=5, step=1)

Shows the row labels (index) of your DataFrame

Viewing DataFrame Content :emoji:

python

df.head()

Expected output:

text

  students  math  science  english
0     Adam    87       78       90
1      Bob    42       51       66
2  Crystal    68       50       42
3    David    99       86       83
4   Edmund    53       70       91

Shows the first 5 rows - perfect for checking your data!

python

df.info()

Expected output:

text

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   students  5 non-null      object
 1   math      5 non-null      int64
 2   science   5 non-null      int64
 3   english   5 non-null      int64
dtypes: int64(3), object(1)
memory usage: 200.0+ bytes

Shows detailed information about your DataFrame

You can change the index labels to make them more meaningful:

python

df.index = ["a","b","c","d","e"]
print(df)

Expected output:

text

  students  math  science  english
a     Adam    87       78       90
b      Bob    42       51       66
c  Crystal    68       50       42
d    David    99       86       83
e   Edmund    53       70       91

Now each row has a letter label instead of a number

Access Columns of Pandas DataFrame :emoji:

To get data from one column, use df[column_name]:

python

df["math"]

Expected output:

text

a    87
b    42
c    68
d    99
e    53
Name: math, dtype: int64

This shows all math scores with their row labels

To get multiple columns at once, use double square brackets df[[column1, column2]]:

python

df[["students", "math"]]

Expected output:

text

  students  math

Access Rows of Pandas DataFrame :clipboard:

To get a specific row by its position number, use df.iloc[index_number]:

python

df.iloc[1]

Expected output:

text

students     Bob
math         42
science      51
english      66
Name: b, dtype: object

Gets the second row (remember: counting starts at 0!)

Just like with lists, you can get multiple rows using df.iloc[start:end]:

python

df.iloc[1:3]

Expected output:

text

  students  math  science  english
b      Bob    42       51       66
c  Crystal    68       50       42

Gets rows from position 1 to 2 (not including 3)

You can also use df.loc[index_name] to get a row by its label name:

python

df.loc['e']

Expected output:

text

students    Edmund
math            53
science         70
english         91
Name: e, dtype: object

Gets the row with label 'e'

Descriptive Statistics with Pandas :chart_with_upwards_trend:

Descriptive Statistics help you understand your data quickly.
Here are 5 useful statistical functions:
- count(): Counts how many values are in each column
- sum(): Adds up all values in each column
- mean(): Calculates the average of each column
- min(): Finds the smallest value in each column
- max(): Finds the largest value in each column

python

df.count()

Expected output:

text

students    5
math        5
science     5
english     5
dtype: int64

Counts the values in each column (all have 5 values)

python

df.sum()

Expected output:

text

students    AdamBobCrystalDavidEdmund
math                               349
science                            335
english                            372
dtype: object

Adds up values (notice how text gets concatenated!)

python

df.mean()

Expected output:

text

math       69.8
science    67.0
english    74.4
dtype: float64

Calculates averages for numeric columns only

python

df.min()

Expected output:

text

students    Adam
math          42
science       50
english       42
dtype: object

Finds minimum values (for text, "Adam" comes first alphabetically)

python

df.max()

Expected output:

text

students    Edmund
math            99
science         86
english         91
dtype: object

Finds maximum values (for text, "Edmund" comes last alphabetically)

The describe() function gives you all the statistics at once:

python

df.describe()

Expected output:

text

           math    science    english
count  5.000000   5.000000   5.00000
mean  69.800000  67.000000  74.40000
std   23.488295  16.093477  20.69541
min   42.000000  50.000000  42.00000
25%   53.000000  51.000000  66.00000
50%   68.000000  70.000000  83.00000
75%   87.000000  78.000000  90.00000
max   99.000000  86.000000  91.00000

A complete statistical summary of your numeric data

:bulb: The describe() function shows count, mean, standard deviation (std), minimum, quartiles (25%, 50%, 75%), and maximum values.

Learn more about describe() here: https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm

Data Correlations with Pandas :link:

To find relationships between columns, use df.corr():
python
df.corr()
Expected output:
text
            math   science   english
math     1.000000  0.773131  0.273812
science  0.773131  1.000000  0.803156
english  0.273812  0.803156  1.000000
Shows how strongly each subject relates to others

:information_source: Understanding Correlation Values:

Values range from -1 to 1

Close to 1: Strong positive relationship (when one goes up, the other goes up)

Close to -1: Strong negative relationship (when one goes up, the other goes down)

Close to 0: Little or no relationship

In our example, science and English have a correlation of 0.80, which means students who do well in science often do well in English too!

Explore more about correlation:

Pandas Data Correlations

Understanding Correlation

:emoji: Summary

In this lesson, you learned:

Pandas is a powerful Python library for working with data

DataFrames are like spreadsheets you can control with code

You can access specific columns and rows of your data

Statistical functions help you understand your data quickly

Correlation shows relationships between different data columns

Video

Practice with AI :emoji:

Introduction to Pandas and DataFrames

Code with AI: Try using AI to work with Pandas DataFrames.

Prompts to try:

"How do I create a Pandas DataFrame from a CSV file?"

"Show me how to access specific columns and rows in a DataFrame."

"Help me calculate the average of a column in my DataFrame."

"How can I find the correlation between two columns in Pandas?" tip Practice Exercise: Create a DataFrame with data about your favorite movies (title, year, rating) and use the functions you learned to analyze it!

Lesson 15 of 20

Concept 15: Pandas

What is Pandas :emoji:

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand what Pandas is and why it's important for data analysis
Create and work with Pandas DataFrames
Access specific rows and columns in your data
Calculate basic statistics from your data
Find relationships between different data columns

:information_source: Definition: Pandas is a Python package that provides high-performance, easy-to-use data structures and data analysis tools. Think of it as a super-powered spreadsheet that you can control with code!

Why Use Pandas?

Pandas helps you work with data - you can clean it, analyze it, and prepare it for machine learning projects.
Pandas is built on top of NumPy, which makes it super fast at handling large amounts of data.
There are two main data structures in Pandas:
- Pandas Series: A one-dimensional labeled array (like a single column of data)
- Pandas DataFrame: A 2-dimensional labeled data structure (like a whole spreadsheet)

:bulb: When you change a DataFrame to view it differently, the original data stays safe! It only changes if you save it back to the same variable using =.

Learn more about Pandas at the official website: https://pandas.pydata.org/

Pandas DataFrame :bar_chart:

Most data in the world is stored in tables (like spreadsheets).

A Pandas DataFrame creates a table with labeled rows and columns.

Every DataFrame has three main parts:

Data: The actual information or values in your table

Index: Row labels (starting from 0 by default, like counting)

Columns: Column names that describe what each column contains

How to Use Pandas :rocket:

Let's learn Pandas step by step:

Install and Import Pandas: Getting Pandas ready to use

Create Pandas DataFrame: Making your own data tables

Basic syntax of Pandas DataFrame: Essential commands you need to know

Access columns of Pandas DataFrame: Getting data from specific columns

Access rows of Pandas DataFrame: Getting data from specific rows

Correlation of data columns: Finding relationships between data

Install and Import Pandas :package:
To use Pandas, VSCode users need to install it first. Type this command in your terminal:
py -m pip install pandas
note Google Colab users: Good news! Pandas is already installed for you.

Since Pandas needs NumPy to work, we import both packages:

python

import numpy as np
import pandas as pd

This code imports the packages we need

Create Pandas DataFrame :emoji:️

To create a DataFrame, we use pd.DataFrame(data, index, columns). Here's how to make a simple score table:

python

df = pd.DataFrame(
    [
        ["Adam", 87, 78, 90],
        ["Bob", 42, 51, 66],
        ["Crystal", 68, 50, 42],
        ["David", 99, 86, 83],
        ["Edmund", 53, 70, 91]
    ],
    columns=["students", "math", "science", "english"]
)

# Display the DataFrame
print(df)

This creates a table of student scores

When you print the DataFrame, it looks like this:

python

print(df)

Expected output:

text

  students  math  science  english
0     Adam    87       78       90
1      Bob    42       51       66
2  Crystal    68       50       42
3    David    99       86       83
4   Edmund    53       70       91

The DataFrame displays as a neat table

You can also create DataFrames using dictionaries. This method is often easier to read:

python

score_data = {
    "students" : ["Adam", "Bob", "Crystal", "David", "Edmund"],
    "math" : [87, 42, 68, 99, 53],
    "science" : [78, 51, 50, 86, 70],
    "english" : [90, 66, 42, 83, 91]
}
df = pd.DataFrame(score_data)

Using a dictionary to create the same DataFrame

Both methods create the exact same table!

Basic DataFrame Commands :wrench:

Here are 4 essential commands to explore your DataFrame:
- columns: Shows all column names
- index: Shows all row labels
- head(): Shows the first 5 rows (great for peeking at your data!)
- info(): Shows basic information about your DataFrame
Let's try each command:

python

df.columns

Expected output:

text

Index(['students', 'math', 'science', 'english'], dtype='object')

Shows all column names in your DataFrame

python

df.index

Expected output:

text

RangeIndex(start=0, stop=5, step=1)

Shows the row labels (index) of your DataFrame

Viewing DataFrame Content :emoji:

python

df.head()

Expected output:

text

  students  math  science  english
0     Adam    87       78       90
1      Bob    42       51       66
2  Crystal    68       50       42
3    David    99       86       83
4   Edmund    53       70       91

Shows the first 5 rows - perfect for checking your data!

python

df.info()

Expected output:

text

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   students  5 non-null      object
 1   math      5 non-null      int64
 2   science   5 non-null      int64
 3   english   5 non-null      int64
dtypes: int64(3), object(1)
memory usage: 200.0+ bytes

Shows detailed information about your DataFrame

You can change the index labels to make them more meaningful:

python

df.index = ["a","b","c","d","e"]
print(df)

Expected output:

text

  students  math  science  english
a     Adam    87       78       90
b      Bob    42       51       66
c  Crystal    68       50       42
d    David    99       86       83
e   Edmund    53       70       91

Now each row has a letter label instead of a number

Access Columns of Pandas DataFrame :emoji:

To get data from one column, use df[column_name]:

python

df["math"]

Expected output:

text

a    87
b    42
c    68
d    99
e    53
Name: math, dtype: int64

This shows all math scores with their row labels

To get multiple columns at once, use double square brackets df[[column1, column2]]:

python

df[["students", "math"]]

Expected output:

text

  students  math

Access Rows of Pandas DataFrame :clipboard:

To get a specific row by its position number, use df.iloc[index_number]:

python

df.iloc[1]

Expected output:

text

students     Bob
math         42
science      51
english      66
Name: b, dtype: object

Gets the second row (remember: counting starts at 0!)

Just like with lists, you can get multiple rows using df.iloc[start:end]:

python

df.iloc[1:3]

Expected output:

text

  students  math  science  english
b      Bob    42       51       66
c  Crystal    68       50       42

Gets rows from position 1 to 2 (not including 3)

You can also use df.loc[index_name] to get a row by its label name:

python

df.loc['e']

Expected output:

text

students    Edmund
math            53
science         70
english         91
Name: e, dtype: object

Gets the row with label 'e'

Descriptive Statistics with Pandas :chart_with_upwards_trend:

Descriptive Statistics help you understand your data quickly.
Here are 5 useful statistical functions:
- count(): Counts how many values are in each column
- sum(): Adds up all values in each column
- mean(): Calculates the average of each column
- min(): Finds the smallest value in each column
- max(): Finds the largest value in each column

python

df.count()

Expected output:

text

students    5
math        5
science     5
english     5
dtype: int64

Counts the values in each column (all have 5 values)

python

df.sum()

Expected output:

text

students    AdamBobCrystalDavidEdmund
math                               349
science                            335
english                            372
dtype: object

Adds up values (notice how text gets concatenated!)

python

df.mean()

Expected output:

text

math       69.8
science    67.0
english    74.4
dtype: float64

Calculates averages for numeric columns only

python

df.min()

Expected output:

text

students    Adam
math          42
science       50
english       42
dtype: object

Finds minimum values (for text, "Adam" comes first alphabetically)

python

df.max()

Expected output:

text

students    Edmund
math            99
science         86
english         91
dtype: object

Finds maximum values (for text, "Edmund" comes last alphabetically)

The describe() function gives you all the statistics at once:

python

df.describe()

Expected output:

text

           math    science    english
count  5.000000   5.000000   5.00000
mean  69.800000  67.000000  74.40000
std   23.488295  16.093477  20.69541
min   42.000000  50.000000  42.00000
25%   53.000000  51.000000  66.00000
50%   68.000000  70.000000  83.00000
75%   87.000000  78.000000  90.00000
max   99.000000  86.000000  91.00000

A complete statistical summary of your numeric data

:bulb: The describe() function shows count, mean, standard deviation (std), minimum, quartiles (25%, 50%, 75%), and maximum values.

Learn more about describe() here: https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm

Data Correlations with Pandas :link:

To find relationships between columns, use df.corr():
python
df.corr()
Expected output:
text
            math   science   english
math     1.000000  0.773131  0.273812
science  0.773131  1.000000  0.803156
english  0.273812  0.803156  1.000000
Shows how strongly each subject relates to others

:information_source: Understanding Correlation Values:

Values range from -1 to 1

Close to 1: Strong positive relationship (when one goes up, the other goes up)

Close to -1: Strong negative relationship (when one goes up, the other goes down)

Close to 0: Little or no relationship

In our example, science and English have a correlation of 0.80, which means students who do well in science often do well in English too!

Explore more about correlation:

Pandas Data Correlations

Understanding Correlation

:emoji: Summary

In this lesson, you learned:

Pandas is a powerful Python library for working with data

DataFrames are like spreadsheets you can control with code

You can access specific columns and rows of your data

Statistical functions help you understand your data quickly

Correlation shows relationships between different data columns

Video

Practice with AI :emoji:

Introduction to Pandas and DataFrames

Code with AI: Try using AI to work with Pandas DataFrames.

Prompts to try:

"How do I create a Pandas DataFrame from a CSV file?"

"Show me how to access specific columns and rows in a DataFrame."

"Help me calculate the average of a column in my DataFrame."

"How can I find the correlation between two columns in Pandas?" tip Practice Exercise: Create a DataFrame with data about your favorite movies (title, year, rating) and use the functions you learned to analyze it!