By the end of this lesson, you will be able to:
:information_source: Definition: Pandas is a Python package that provides high-performance, easy-to-use data structures and data analysis tools. Think of it as a super-powered spreadsheet that you can control with code!
Pandas helps you work with data - you can clean it, analyze it, and prepare it for machine learning projects.
Pandas is built on top of NumPy, which makes it super fast at handling large amounts of data.
There are two main data structures in Pandas:
:bulb: When you change a DataFrame to view it differently, the original data stays safe! It only changes if you save it back to the same variable using
=
.
- Learn more about Pandas at the official website: https://pandas.pydata.org/
Pandas DataFrame :bar_chart:
Most data in the world is stored in tables (like spreadsheets).
A Pandas DataFrame creates a table with labeled rows and columns.
Every DataFrame has three main parts:
- Data: The actual information or values in your table
- Index: Row labels (starting from 0 by default, like counting)
- Columns: Column names that describe what each column contains
How to Use Pandas :rocket:
- Let's learn Pandas step by step:
- Install and Import Pandas: Getting Pandas ready to use
- Create Pandas DataFrame: Making your own data tables
- Basic syntax of Pandas DataFrame: Essential commands you need to know
- Access columns of Pandas DataFrame: Getting data from specific columns
- Access rows of Pandas DataFrame: Getting data from specific rows
- Correlation of data columns: Finding relationships between data
Install and Import Pandas :package:
- To use Pandas, VSCode users need to install it first. Type this command in your terminal:
py -m pip install pandas
note Google Colab users: Good news! Pandas is already installed for you.
import numpy as np
import pandas as pd
This code imports the packages we need
pd.DataFrame(data, index, columns)
. Here's how to make a simple score table:df = pd.DataFrame(
[
["Adam", 87, 78, 90],
["Bob", 42, 51, 66],
["Crystal", 68, 50, 42],
["David", 99, 86, 83],
["Edmund", 53, 70, 91]
],
columns=["students", "math", "science", "english"]
)
# Display the DataFrame
print(df)
This creates a table of student scores
print(df)
Expected output:
students math science english
0 Adam 87 78 90
1 Bob 42 51 66
2 Crystal 68 50 42
3 David 99 86 83
4 Edmund 53 70 91
The DataFrame displays as a neat table
score_data = {
"students" : ["Adam", "Bob", "Crystal", "David", "Edmund"],
"math" : [87, 42, 68, 99, 53],
"science" : [78, 51, 50, 86, 70],
"english" : [90, 66, 42, 83, 91]
}
df = pd.DataFrame(score_data)
Using a dictionary to create the same DataFrame
Here are 4 essential commands to explore your DataFrame:
Let's try each command:
df.columns
Expected output:
Index(['students', 'math', 'science', 'english'], dtype='object')
Shows all column names in your DataFrame
df.index
Expected output:
RangeIndex(start=0, stop=5, step=1)
Shows the row labels (index) of your DataFrame
df.head()
Expected output:
students math science english
0 Adam 87 78 90
1 Bob 42 51 66
2 Crystal 68 50 42
3 David 99 86 83
4 Edmund 53 70 91
Shows the first 5 rows - perfect for checking your data!
df.info()
Expected output:
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 students 5 non-null object
1 math 5 non-null int64
2 science 5 non-null int64
3 english 5 non-null int64
dtypes: int64(3), object(1)
memory usage: 200.0+ bytes
Shows detailed information about your DataFrame
df.index = ["a","b","c","d","e"]
print(df)
Expected output:
students math science english
a Adam 87 78 90
b Bob 42 51 66
c Crystal 68 50 42
d David 99 86 83
e Edmund 53 70 91
Now each row has a letter label instead of a number
df[column_name]
:df["math"]
Expected output:
a 87
b 42
c 68
d 99
e 53
Name: math, dtype: int64
This shows all math scores with their row labels
df[[column1, column2]]
:df[["students", "math"]]
Expected output:
students math
df.iloc[index_number]
:df.iloc[1]
Expected output:
students Bob
math 42
science 51
english 66
Name: b, dtype: object
Gets the second row (remember: counting starts at 0!)
df.iloc[start:end]
:df.iloc[1:3]
Expected output:
students math science english
b Bob 42 51 66
c Crystal 68 50 42
Gets rows from position 1 to 2 (not including 3)
df.loc[index_name]
to get a row by its label name:df.loc['e']
Expected output:
students Edmund
math 53
science 70
english 91
Name: e, dtype: object
Gets the row with label 'e'
Descriptive Statistics help you understand your data quickly.
Here are 5 useful statistical functions:
df.count()
Expected output:
students 5
math 5
science 5
english 5
dtype: int64
Counts the values in each column (all have 5 values)
df.sum()
Expected output:
students AdamBobCrystalDavidEdmund
math 349
science 335
english 372
dtype: object
Adds up values (notice how text gets concatenated!)
df.mean()
Expected output:
math 69.8
science 67.0
english 74.4
dtype: float64
Calculates averages for numeric columns only
df.min()
Expected output:
students Adam
math 42
science 50
english 42
dtype: object
Finds minimum values (for text, "Adam" comes first alphabetically)
df.max()
Expected output:
students Edmund
math 99
science 86
english 91
dtype: object
Finds maximum values (for text, "Edmund" comes last alphabetically)
describe()
function gives you all the statistics at once:df.describe()
Expected output:
math science english
count 5.000000 5.000000 5.00000
mean 69.800000 67.000000 74.40000
std 23.488295 16.093477 20.69541
min 42.000000 50.000000 42.00000
25% 53.000000 51.000000 66.00000
50% 68.000000 70.000000 83.00000
75% 87.000000 78.000000 90.00000
max 99.000000 86.000000 91.00000
A complete statistical summary of your numeric data
:bulb: The
describe()
function shows count, mean, standard deviation (std), minimum, quartiles (25%, 50%, 75%), and maximum values.
- Learn more about
describe()
here: https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htmData Correlations with Pandas :link:
- To find relationships between columns, use
df.corr()
:pythondf.corr()
Expected output:
textmath science english math 1.000000 0.773131 0.273812 science 0.773131 1.000000 0.803156 english 0.273812 0.803156 1.000000
Shows how strongly each subject relates to others
:information_source: Understanding Correlation Values:
- Values range from -1 to 1
- Close to 1: Strong positive relationship (when one goes up, the other goes up)
- Close to -1: Strong negative relationship (when one goes up, the other goes down)
- Close to 0: Little or no relationship
In our example, science and English have a correlation of 0.80, which means students who do well in science often do well in English too!
- Explore more about correlation:
:emoji: Summary
In this lesson, you learned:
- Pandas is a powerful Python library for working with data
- DataFrames are like spreadsheets you can control with code
- You can access specific columns and rows of your data
- Statistical functions help you understand your data quickly
- Correlation shows relationships between different data columns
Video
Practice with AI :emoji:
Introduction to Pandas and DataFrames
Code with AI: Try using AI to work with Pandas DataFrames.
Prompts to try:
- "How do I create a Pandas DataFrame from a CSV file?"
- "Show me how to access specific columns and rows in a DataFrame."
- "Help me calculate the average of a column in my DataFrame."
- "How can I find the correlation between two columns in Pandas?" tip Practice Exercise: Create a DataFrame with data about your favorite movies (title, year, rating) and use the functions you learned to analyze it!