Apply your knowledge to build something amazing!
:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 2-3 hours
Skills Practiced:
- Data exploration and cleaning
- Feature scaling and preparation
- Linear regression modeling
- Model evaluation and testing
- Python programming with pandas, numpy, and scikit-learn
In this exciting project, you'll become a social media data scientist! :rocket: You will use your knowledge in Data Preparation and Regression to create a regression model that predicts the amount of engagement an instagram post has gained.
graph LR
A[Phase 1: Setup & Import] --> B[Phase 2: Data Exploration]
B --> C[Phase 3: Data Preparation]
C --> D[Phase 4: Model Building]
D --> E[Bonus Challenges]
style A fill:#e1f5fe
style B fill:#fff9c4
style C fill:#f3e5f5
style D fill:#e8f5e9
style E fill:#ffe0b2
Before coding, make sure to create a new Google Colab Notebook named "P1_InstagramReachPrediction.ipynb" and do the coding inside.
Instagram is one of the most popular social media applications today. People use Instagram professionally to promote their business, building a portfolio, blogging, and creating various kinds of content. For these people, it is important to know how well their instagram posts are doing.
One way to measure how successful their instagram posts are is through the amount of interactions / reach the posts have gained. These interactions can come in the form of:
Based on these values, we can generate a number known as engagement rate to serve as an all-in-one measure of the reach the posts have gained.
The dataset for this project is collected by a data scientist named Aman Kharwal for instagram reach prediction purposes. It contains information about 99 instagram posts as well as its engagement rate.
Likes | Comments | Shares | Saves | Profile Visits | Follows | Engagement |
---|---|---|---|---|---|---|
162.0 | 9.0 | 5.0 | 98.0 | 35.0 | 2.0 | 3920.0 |
224.0 | 7.0 | 14.0 | 194.0 | 48.0 | 10.0 | 5394.0 |
131.0 | 11.0 | One.0 | 41.0 | 62.0 | 12.0 | 4021.0 |
To learn more about the various metrics of measuring an instagram post's success, you may go through this article.
:bulb: Before You Begin Make sure you have:
- A Google account to use Google Colab
- Basic understanding of Python programming
- Completed lessons on Data Preparation and Regression
Make a copy of the template file found here and rename the copied file as "P1: Instagram Reach Analysis.ipynb".
Run the code here to download the dataset onto your collab file. Wait for the download process to complete before proceeding to the next step.
# Download file
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=185l3XDtStnSPdtvenICqoB_--Yqbf7SG' -O 'Instagram.csv'
:warning: Common Issue
If the download fails, check your internet connection and try again. The file should be named 'Instagram.csv' in your Colab environment.
You may start to do your coding here:
Phase One: Import Dependencies & Data Reading
info Milestone Checkpoint 1 By the end of this phase, you should have:
Import the necessary libraries to complete the project. These include:
# Import basic libraries for data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Run the next set of imports to ensure that all the necessary libraries are installed.
# Import additional visualization libraries
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
:bulb: Best Practice Always import libraries at the beginning of your notebook. This makes it easy to see all dependencies at a glance.
Read the csv file "Instagram.csv" and save it as instagram_data. When reading the file, set the encoding of the file as latin1. (Remarks: You may copy the following code.)
# Load the Instagram dataset with proper encoding
instagram_data = pd.read_csv("Instagram.csv", encoding='latin1')
:warning: Encoding Alert
The
encoding='latin1'
parameter is crucial here! Without it, you might get encoding errors because the dataset contains special characters.
Phase 2: Explore the Data
Time to become a data detective! :mag: Let's explore what's in our Instagram dataset.
Please refer to Chapter 5: Data Preparation to complete the following steps. info Milestone Checkpoint 2 By the end of this phase, you should:
Print the first 5 rows of instagram_data.
# Display the first 5 rows to understand the data structure
instagram_data.head()
Print more information about instagram_data.
# Get detailed information about the dataset
instagram_data.info()
Check how many missing values are in the instagram_data.
# Check for missing values in each column
instagram_data.isnull().sum()
If there are any missing values in the dataset, remove them.
# Remove rows with missing values
instagram_data = instagram_data.dropna()
:warning: Data Cleaning Alert
Always check your data size before and after removing missing values. You don't want to accidentally delete too much data!
Run the following codes to show the most commonly used words in the instagram posts given in the dataset:
python
# Most commonly used words in these Instagram posts text = " ".join(i for i in instagram_data.Caption) stopwords = set(STOPWORDS) wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) plt.style.use('classic') plt.figure(figsize=(12,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
:bulb: Understanding Word Clouds The bigger the word appears, the more frequently it's used in Instagram captions. This helps identify popular topics!
Run the following codes to show the most commonly used hashtags in the instagram posts given in the dataset.
python
# Most commonly used hashtags in the instagram posts text = " ".join(i for i in instagram_data.Hashtags) stopwords = set(STOPWORDS) wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) plt.figure(figsize=(12,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Run the following codes to find the relationship between the features and the engagement rate of instagram posts.
python
# Find the relationship between the engagement and the other features in the dataset correlation = instagram_data.corr() print(correlation["Engagement"].sort_values(ascending=False))
:bulb: Correlation Insights Values close to 1 mean strong positive correlation (when one goes up, the other goes up too). Values close to -1 mean strong negative correlation. Values near 0 mean little to no correlation.
Phase 3: Data Preparation
Now let's prepare our data for machine learning! This is a crucial step that can make or break your model. :dart:
Please refer to Chapter 5: Data Preparation to complete the following steps. info Milestone Checkpoint 3 By the end of this phase, you should have:
Run the following codes to generate the dataset x and its labels y.
# Generate the dataset (x) and its labels (y)
# x contains the features we'll use to predict engagement
# y contains the engagement values we want to predict
x = np.array(instagram_data[['Likes', 'Comments', 'Shares', 'Profile Visits', 'Follows']])
y = np.array(instagram_data['Engagement'])
:bulb: Understanding Features vs Labels
- Features (x): The information we use to make predictions (likes, comments, etc.)
- Labels (y): What we're trying to predict (engagement rate)
Type in the following codes here:
# Import the MinMaxScaler from sklearn
from sklearn.preprocessing import MinMaxScaler
# Create a scaler object
scaler = MinMaxScaler()
# Fit the scaler to our data (learns the min and max values)
scaler.fit(x)
# Transform the data to scale it between 0 and 1
x_scaled = scaler.transform(x)
:warning: Why Scale Data?
Machine learning algorithms work better when all features are on the same scale. Without scaling, features with larger values (like Likes) might dominate features with smaller values (like Comments).
Print the values of x_scaled.
python
# Check that our data is now scaled between 0 and 1 print("Scaled data sample:") print(x_scaled[:5]) # Show first 5 rows
Split the dataset x_scaled and its labels y into training and test sets. Set the test size to be 0.33 and the random state to be 42.
python
# Import train_test_split function from sklearn.model_selection import train_test_split # Split the data: 67% for training, 33% for testing x_train, x_test, y_train, y_test = train_test_split( x_scaled, y, test_size=0.33, random_state=42 )
:bulb: Random State Explained Setting
random_state=42
ensures everyone gets the same random split. It's like setting a seed for reproducibility!Use variable shape to check your answer.
python
print("Dataset shapes:") print(f"x_train shape: {x_train.shape}") print(f"x_test shape: {x_test.shape}") print(f"y_train shape: {y_train.shape}") print(f"y_test shape: {y_test.shape}")
Expected output:
outputx_train shape: (66, 5) x_test shape: (33, 5) y_train shape: (66,) y_test shape: (33,)
:warning: Debugging Tip If your shapes don't match, check:
- Did you remove missing values in Phase 2?
- Did you use the correct test_size (0.33)?
- Did you scale the data before splitting?
Phase 4: Instagram Reach Prediction Model
Time to build your AI model! This is where the magic happens. šŖ
Please refer to Chapter 6: Regression to complete the following steps. info Milestone Checkpoint 4 By the end of this phase, you should have:
Create and evaluate the Instagram Reach Prediction Model through the following codes. (Remarks: Type in your codes for this step here)
# Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression
# Create a linear regression model
model = LinearRegression()
# Train the model with our training data
model.fit(x_train, y_train)
# Evaluate the model's performance on test data
accuracy = model.score(x_test, y_test)
print(f"Model Accuracy: {accuracy:.4f}")
Expected output:
Model Accuracy: 0.8461
:bulb: Understanding Model Accuracy An accuracy of 0.8461 means our model explains about 84.61% of the variation in engagement. That's pretty good! In real-world projects, anything above 80% is often considered successful.
Retrieve and print the gradient / slope of model.
# Get the coefficients (slopes) for each feature
print("Model coefficients:")
print(model.coef_)
Expected output:
[115.67681478 1898.72802154 -394.46756815 -495.96781512 748.19781634]
:bulb: Interpreting Coefficients Each coefficient tells us how much engagement changes when that feature increases by 1 unit:
- Positive values = feature increases engagement
- Negative values = feature decreases engagement
- Larger absolute values = stronger impact
Retrieve and print the y-intercept of model.
# Get the y-intercept (base engagement when all features are 0)
print(f"Model intercept: {model.intercept_}")
Expected output:
Model intercept: 2358.9711885315516
Test the model with the custom instagram post data shown below:
:warning: Data Order Alert
Make sure your features are in the correct order: [Likes, Comments, Shares, Profile Visits, Follows]
python
# Create test data for a new Instagram post # Note: Order must match our training data features = np.array([[282.0, 4.0, 9.0, 165.0, 54.0]]) # Scale the features using our fitted scaler features_scaled = scaler.transform(features) # Make a prediction predicted_engagement = model.predict(features_scaled) print(f"Predicted engagement: {predicted_engagement[0]:.2f}")
The output should be around 9300 - 9500.
:bulb: Real-World Application You've just built a tool that Instagram influencers could use to predict how well their posts will perform! :tada:
:star2: Extension Challenges
Ready to level up? Here are some bonus challenges to push your skills further!
Advanced Challenge One: User Interface
Create an interactive tool that anyone can use! info Challenge Goal Build a user-friendly interface that allows anyone to predict their Instagram post engagement without knowing how to code.
Prompt the users to key in the following information about their instagram posts.
# Create an interactive engagement predictor
print("=== Instagram Engagement Predictor ===")
print("Enter your post statistics below:\n")
# Collect user inputs with validation
try:
likes = float(input("Number of Likes: "))
comments = float(input("Number of Comments: "))
shares = float(input("Number of Shares: "))
profile_visits = float(input("Profile Visits from this post: "))
follows = float(input("New Follows from this post: "))
except ValueError:
print("Please enter valid numbers!")
:warning: Input Validation
Always validate user inputs! Real users might enter text instead of numbers, so handle errors gracefully.
Save all the user input as a numpy array input_data.
python
# Save all user inputs into a single numpy array # Note: We removed 'saves' to match our model's features input_data = np.array([[likes, comments, shares, profile_visits, follows]])
Scale the data and make predictions.
python
# Scale the input data input_data_scaled = scaler.transform(input_data) # Make prediction predicted_engagement = model.predict(input_data_scaled) # Display result in a user-friendly way print(f"\nš Predicted Engagement: {predicted_engagement[0]:.0f}") print(f"šÆ Engagement Level: ", end="") if predicted_engagement[0] > 10000: print("š„ Viral potential!") elif predicted_engagement[0] > 5000: print("ā Great engagement!") else: print("šŖ Keep creating!")
Advanced Challenge 2: Model Evaluation
Let's dive deeper into understanding how well our model performs! info Challenge Goal Learn to evaluate your model using different metrics to understand its strengths and weaknesses.
Find the mean absolute error (MAE), mean square error(MSE) and root mean square error (RMSE) of model:
# Import metrics from scikit-learn
from sklearn import metrics
# Make predictions on test data
y_pred = model.predict(x_test)
# Calculate different error metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
# Display results with explanations
print("Model Evaluation Metrics:")
print(f"MAE (Mean Absolute Error): {mae:.2f}")
print(f"MSE (Mean Squared Error): {mse:.2f}")
print(f"RMSE (Root Mean Squared Error): {rmse:.2f}")
Expected output (approximately):
MAE (Mean Absolute Error): 741.67
MSE (Mean Squared Error): 1171979.70
RMSE (Root Mean Squared Error): 1082.59
:bulb: Understanding Error Metrics
- MAE: Average prediction error (in engagement units)
- MSE: Penalizes large errors more heavily
- RMSE: Same units as engagement, easier to interpret
Lower values = better model performance!
Having trouble? Here are common issues and solutions:
Import Errors
# If you get "No module named 'sklearn'"
!pip install scikit-learn
Shape Mismatch Errors
Low Model Accuracy
Congratulations! You've successfully built an Instagram Engagement Predictor using machine learning. You've learned: