Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)💡 Download the ZIP, extract it, and follow the instructions below to get started!
Estimated Time: 60 minutes Course: AI-1 Data Analysis and Data Science Activity: 15
Master the essential data manipulation library! This activity introduces Pandas DataFrames and Series - the backbone of data analysis workflows.
By the end of this activity, you will be able to:
Choose your development environment based on your needs:
Best for: Local development with cloud runtime consistency
Setup (first time only):
Open in VS Code:
code activity-15-pandas-basics.ipynb
Select Google Colab Kernel:
Cmd+Shift+P -> Type "Select Kernel" -> Choose "Google Colab"Authenticate:
Run First Cell:
Advantages:
First Connection: ~30-60 seconds Subsequent Connections: ~10-15 seconds
Best for: Quick access from any device, no setup required
Setup:
Open in Browser:
activity-15-pandas-basics.ipynbRun Cells:
Shift+EnterAdvantages:
Note: Web Colab has the same free tier limits as VS Code integration
Best for: Offline work, quick edits, familiar environment
Setup:
# Install dependencies (one-time)
pip install -r requirements.txt
# Start Jupyter Notebook
jupyter notebook activity-15-pandas-basics.ipynb
# Or use Jupyter Lab (modern interface)
jupyter lab activity-15-pandas-basics.ipynb
Advantages:
Limitations:
Use When: Quick edits, no internet, testing local changes
Note: Large DataFrames (100k+ rows) may increase memory usage. Use chunking for ``datasets >1``GB.
Before starting, understand these session limits:
| Limit | Value | Impact |
|---|---|---|
| Session Duration | 12 hours max | Restart every 12 hours |
| Idle Timeout | 90 minutes | Disconnects if inactive |
| RAM | 12GB | Sufficient for this activity |
💡 Tips:
Cmd+S (Mac) or Ctrl+S (Windows)df.to_csv('checkpoint.csv', index=False))65% of the code is implemented for you:
Before jumping into TODOs:
Estimated Time: 10 minutes
What You'll Build: Create Pandas DataFrames from dictionaries, lists, and NumPy arrays.
Success Criteria:
.info() and .describe()Hints:
pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})pd.DataFrame(data, columns=['A', 'B', 'C'])pd.DataFrame(np.array([[1,2], [3,4]]), columns=['X', 'Y'])df.info() shows types, df.describe() shows statisticsdf.index = ['row1', 'row2'] for custom row labelsExpected Output:
DataFrame from dictionary:
Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Carol 28 Paris
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 City 3 non-null object
Key Concepts:
Code Location: Cell 5 (search for # TODO 1:)
Estimated Time: 10 minutes
What You'll Build: Master different ways to access DataFrame data: columns, rows, and specific cells.
Success Criteria:
.loc[] (label-based).iloc[] (integer-based)Hints:
df['column_name'] or df.column_namedf[['col1', 'col2']] (note double brackets)df.loc['row_label'] or df.loc[0:2] (inclusive)df.iloc[0] or df.iloc[0:3] (exclusive)df.loc['row', 'col'] or df.iloc[0, 1]Expected Output:
Single column (Series):
0 25
1 30
2 28
Name: Age, dtype: int64
Multiple columns (DataFrame):
Name Age
0 Alice 25
1 Bob 30
2 Carol 28
Row by label (loc):
Name Bob
Age 30
City London
Name: 1, dtype: object
Row by position (iloc):
Name Bob
Age 30
City London
Name: 1, dtype: object
Key Concepts:
.loc[]: Label-based indexing (uses row/column names).iloc[]: Position-based indexing (uses integer positions)loc is inclusive, iloc is exclusive (like Python lists)Code Location: Cell 7 (search for # TODO 2:)
Estimated Time: 15 minutes
What You'll Build: Filter DataFrames using conditional expressions (essential for data analysis).
Success Criteria:
& (AND) and | (OR).isin() to filter by list of values.isna() / .notna()Hints:
df[df['Age'] > 25]df[(df['Age'] > 25) & (df['City'] == 'Paris')] (note parentheses!)df[(df['Age'] > 30) | (df['City'] == 'London')]df[df['City'].isin(['Paris', 'London'])]df[df['Age'].notna()]Expected Output:
Age greater than 25:
Name Age City
1 Bob 30 London
2 Carol 28 Paris
Multiple conditions (Age > 25 AND City = 'Paris'):
Name Age City
2 Carol 28 Paris
Filter by list (City in ['Paris', 'London']):
Name Age City
1 Bob 30 London
2 Carol 28 Paris
Key Concepts:
>, <, ==, !=, >=, <=& (AND), | (OR), ~ (NOT) - use parentheses!.isin(): Check if values are in a list.isna() / .notna(): Detect missing values (NaN)Code Location: Cell 9 (search for # TODO 3:)
Estimated Time: 10 minutes
What You'll Build: Add, modify, and delete columns; sort data; handle duplicates.
Success Criteria:
.drop()Hints:
df['new_col'] = df['col1'] + df['col2']df['Age'] = df['Age'] + 1 (increment all ages)df.drop('col_name', axis=1, inplace=True) (axis=1 for columns)df.sort_values('Age')df.sort_values('Age', ascending=False)df.sort_values(['City', 'Age'])Expected Output:
Added 'Age_Plus_5' column:
Name Age City Age_Plus_5
0 Alice 25 New York 30
1 Bob 30 London 35
2 Carol 28 Paris 33
Sorted by Age (ascending):
Name Age City
0 Alice 25 New York
2 Carol 28 Paris
1 Bob 30 London
Sorted by City then Age:
Name Age City
1 Bob 30 London
0 Alice 25 New York
2 Carol 28 Paris
Key Concepts:
df['name'] = values.drop(): Remove columns (axis=1) or rows (axis=0)inplace=True: Modify DataFrame directly (no return value).sort_values(): Sort by column(s), default ascendingdf.sort_values('Age').head(5)Code Location: Cell 11 (search for # TODO 4:)
Estimated Time: 10 minutes
What You'll Build: Detect and handle missing values (NaN) using Pandas methods.
Success Criteria:
.isna() and .sum().dropna().fillna().replace()Hints:
df.isna().sum() (per column)df.dropna()df.dropna(how='all')df.dropna(axis=1)df.fillna(0) or df.fillna(df.mean())df.fillna(method='ffill') (use previous value)Expected Output:
Missing values per column:
Name 0
Age 1
City 2
Salary 1
dtype: int64
After dropping rows with NaN:
Name Age City Salary
0 Alice 25 New York 50000
After filling NaN with 0:
Name Age City Salary
0 Alice 25.0 New York 50000
1 Bob 30.0 0 60000
2 Carol 0.0 Paris 0
Key Concepts:
.isna() / .isnull(): Detect missing values (both equivalent).dropna(): Remove rows/columns with missing values.fillna(): Replace missing values with specified value or methodCode Location: Cell 13 (search for # TODO 5:)
Estimated Time: 15 minutes
What You'll Build: Perform statistical aggregations and group-based calculations.
Success Criteria:
.groupby().agg() for custom aggregationsHints:
df['Age'].mean(), df['Salary'].sum(), df['City'].value_counts()df.groupby('City')['Salary'].mean()df.groupby('City').agg({'Salary': ['mean', 'sum'], 'Age': 'max'})df.groupby(['City', 'Department']).size().reset_index()Expected Output:
Average salary by city:
City
London 60000
New York 50000
Paris 55000
Name: Salary, dtype: int64
Multiple aggregations:
Salary Age
mean sum max
City
London 60000 60000 30
New York 50000 50000 25
Paris 55000 110000 28
Group size (count per city):
City
London 1
New York 1
Paris 2
dtype: int64
Key Concepts:
.groupby(): Split data into groups based on column values.agg(): Apply multiple aggregation functions at once.value_counts(): Count occurrences of unique valuesCode Location: Cell 15 (search for # TODO 6:)
Estimated Time: 15 minutes
What You'll Build: Combine multiple DataFrames using merge, join, and concat operations.
Success Criteria:
.merge() (SQL-like joins).join()Hints:
pd.concat([df1, df2], axis=0) (vertical stack)pd.concat([df1, df2], axis=1) (horizontal stack)pd.merge(df1, df2, on='key', how='inner')pd.merge(df1, df2, on='key', how='left')pd.merge(df1, df2, on='key', how='outer')df1.join(df2, how='left')Expected Output:
Concatenated DataFrames (vertical):
Name Age
0 Alice 25
1 Bob 30
0 Carol 28
1 David 35
Inner merge (matching keys only):
Name Age City Department
0 Alice 25 New York HR
1 Bob 30 London IT
Left merge (all from left, NaN for missing):
Name Age City Department
0 Alice 25 New York HR
1 Bob 30 London IT
2 Carol 28 Paris NaN
Key Concepts:
pd.concat(): Stack DataFrames vertically (axis=0) or horizontally (axis=1).merge(): SQL-style joins using common column(s).join(): Shortcut for merging on indexCode Location: Cell 17 (search for # TODO 7:)
.query() (Hard)Estimated Time: 10 minutes
What You'll Build:
Use the .query() method for cleaner, SQL-like filtering syntax.
Success Criteria:
.query()@ symbolHints:
df.query('Age > 25')df.query('Age > 25 and City == "Paris"')threshold = 30; df.query('Age > @threshold')df.query('City in ["Paris", "London"]')df.query('Column Name > 10') (use backticks)Expected Output:
Query: Age > 25
Name Age City
1 Bob 30 London
2 Carol 28 Paris
Query: Age > 25 and City == 'Paris'
Name Age City
2 Carol 28 Paris
Query with variable (threshold=27):
Name Age City
1 Bob 30 London
2 Carol 28 Paris
Key Concepts:
.query(): SQL-like filtering with string expressions@variable_name to reference external variables.query() can be faster for large DataFramesCode Location: Cell 19 (search for # TODO 8:)
Run All Cells:
Data Validation:
DataFrame Inspection:
df.shape to verify dimensionsdf.dtypes to check column typesdf.head() to preview datadf.info() to check for missing valuesIssue: "ModuleNotFoundError: No module named 'pandas'" Solution:
# In Colab, pandas is pre-installed
# If using local Jupyter, install dependencies:
pip install -r requirements.txt
# Verify installation:
import pandas as pd
print(pd.__version__) # Should print version number
# Then restart kernel: Kernel → Restart
Issue: "KeyError: 'column_name'" Solution:
print(df.columns.tolist())'Age ' vs 'Age'df1['col'] when you meant df2['col']df.columnsIssue: "SettingWithCopyWarning: A value is trying to be set on a copy of a slice" Solution:
df_subset = df[df['Age'] > 25]; df_subset['New'] = 0 (modifying slice).loc[] for assignment
df.loc[df['Age'] > 25, 'New'] = 0
df_subset = df[df['Age'] > 25].copy()
df_subset['New'] = 0
Issue: "TypeError: 'Series' object is not callable" Solution:
df.Age() or df['Age']()df['Age'] or df.Agedf['Age'] # ✅ Select column
df.head() # ✅ Call method
df['Age']() # ❌ Error: Series is not callable
Issue: Memory error with large DataFrames Solution:
df.memory_usage(deep=True).sum() / 1024**2 (MB)# Convert to smaller data types
df['Age'] = df['Age'].astype('int8') # If values fit in int8 (0-255)
df['Category'] = df['Category'].astype('category') # For repeated strings
# Load data in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process(chunk)
# Use specific columns only
df = pd.read_csv('file.csv', usecols=['Col1', 'Col2'])
Issue: "ValueError: The truth value of a Series is ambiguous" Solution:
and/or instead of Pandas' &/|df[(df['Age'] > 25) and (df['City'] == 'Paris')]df[(df['Age'] > 25) & (df['City'] == 'Paris')]& (AND), | (OR), ~ (NOT)and, or, not (for single values only)Issue: Notebook works in Colab but fails locally Solution:
import pandas as pd, numpy as np
print(f"Pandas: {pd.__version__}, NumPy: {np.__version__}")
pip install pandas==2.1.4 numpy==1.26.4
Inspect DataFrames at Every Step:
print(f"Shape: {df.shape}") # (rows, columns)
print(f"Columns: {df.columns}") # Column names
print(df.head()) # First 5 rows
print(df.info()) # Types and missing values
print(df.describe()) # Statistics for numeric columns
Check Data Types:
print(df.dtypes) # Column data types
# If Age is 'object' instead of 'int64', conversion needed:
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
Find Missing Values:
print(df.isna().sum()) # Count NaN per column
print(df[df['Age'].isna()]) # Show rows with NaN in 'Age'
Test Filters Incrementally:
# Test one condition at a time
mask1 = df['Age'] > 25
print(f"Mask1 true count: {mask1.sum()}")
mask2 = df['City'] == 'Paris'
print(f"Mask2 true count: {mask2.sum()}")
combined = mask1 & mask2
print(f"Combined true count: {combined.sum()}")
print(df[combined])
Use .loc[] for Safe Assignment:
# ❌ May cause SettingWithCopyWarning:
df_subset = df[df['Age'] > 25]
df_subset['New'] = 0
# ✅ Safe approach:
df.loc[df['Age'] > 25, 'New'] = 0
Completed the main activity? Try these bonus challenges:
.duplicated() and .drop_duplicates() [+15 min].apply() with custom functions [+30 min].melt() (wide to long) and .pivot() (long to wide) [+45 min]Your activity is complete when:
Cmd+S / Ctrl+S1_6a628mpxKjVz2KeF5ET8rkjGiPwfMtfOnce you complete this activity, you'll have:
This activity unlocks your ability to work with real datasets! Pandas is the industry standard for data manipulation in Python - you'll use these skills in every data analysis project.
Stuck on a TODO?
print(df.head()), print(df.info())Still stuck after 15 minutes?
Pro Tips:
.head() to preview DataFrames before operations.dtypes (many errors come from wrong types).loc[] for safe assignment (avoids SettingWithCopyWarning)Remember: Learning Pandas is a journey, not a race. It's OK to:
Happy data wrangling! 🚀📊✨
This activity follows the 65-70% implementation methodology: core DataFrame operations provided, students complete targeted manipulation exercises