:information_source: Project Overview
:bar_chart: Difficulty Level: Intermediate
⏱️ Estimated Time: 6-8 hours
:hammer_and_wrench: Skills Practiced:
Data cleaning and preprocessing with Pandas
Data analysis and grouping operations
Data visualization with Matplotlib
Report writing and presentation skills
Video communication of technical findings
This project focuses on analyzing Google Play Store data to extract meaningful insights and generate a comprehensive report. You'll work with real-world data to identify top-performing apps and categories, practicing essential data science skills along the way.
scss
📥 Data Collection → 🧹 Data Cleaning → 📊 Analysis → 📈 Visualization → 📝 Report → 🎥 Video
(Week 1 ) (Week 1 ) (Week 1 ) (Week 1 ) (Week 2 ) (Week 2 )
Week
Focus
L18
Dataset Analysis, Initial Findings
L19
Report Writing, Video Submission
Platform : Google Colab Notebook
Requirements :
Analyze Google Play Store data from the provided CSV file
Identify the top 10 most installed free apps with the highest number of reviews
Determine the top 5 categories with the highest number of installed apps
Perform necessary data cleaning and manipulation
Create visualizations where applicable to support the findings
:warning: Critical Setup Instructions
:warning: Before you start coding:
Download the googleplaystore.csv
file from Kaggle
Make a copy of the provided Colab template - do NOT work in the original file
Upload the CSV file to your Colab environment
Ensure all required libraries (Pandas, NumPy, Matplotlib) are imported
Format : Google Docs (or equivalent)
Requirements :
Cover/Title : Include the title of the project
Overview/Introduction :
Background of the Google Play Store dataset
Clear definition of project objectives
Results (minimum 2 distinct results):
At least one figure showing the top 10 apps
At least one figure showing the top 5 categories
Observations and insights based on results
Supporting interpretation and analysis
Conclusion :
Summarize the findings
Offer recommendations based on the analysis
Discuss potential improvements and areas for further investigation
References : Include credible sources for data interpretation and additional context
Duration : 2-3 minutes
Requirements :
Record a short video explaining your project and report
Highlight key findings and visualizations
Provide context for the analysis, focusing on why these results matter
Demonstrate your understanding of the data and the methodology used in the project
Code Quality (25%) :
Code is clean, well-organized, and correctly implements required techniques
Efficient use of functions, loops, and libraries for data manipulation
Analysis Report (30%) :
Clear structure with an introduction, results, and conclusion
Proper use of visualizations to support findings
Depth of insights derived from the data
Results & Insights (20%) :
Relevance and significance of results
Clear interpretation of the data
Video Submission (25%) :
Clear explanation of the project and results
Well-organized, concise presentation
Demonstrates understanding of the data analysis process
Final submission should include:
Google Colab notebook(s) with all code and analysis
Complete report document
2-3 minute video explaining and showing the project
Any supplementary materials referenced
:bulb: Before Submitting
:white_check_mark: Double-check that your Colab notebook runs without errors
:white_check_mark: Ensure all visualizations are clearly labeled
:white_check_mark: Verify your report includes all required sections
:white_check_mark: Test your video for audio and visual clarity
Please double-check your project before submitting it. Ensure you have included the Python code file, Report, and video explanation as required.
:link: Submit your project here
Complete the following project.
Google Playstore Colab File
:warning: Important Reminders
:emoji: Critical Steps:
Please make a copy of the colab file before you do the coding
Make sure to do the project within the copy you just made, NOT in the template file
Save your work frequently to avoid losing progress
In this project, you will analyze the Google Play Store dataset (googleplaystore.csv
) to meet specific data cleaning, data analysis, and data visualization objectives.
You will perform data wrangling, find top apps and categories based on installs and reviews, and create basic graphs.
Find the Top 10 most installed free apps with the highest number of reviews.
Find the Top 5 categories with the highest number of installed apps.
Download googleplaystore.csv
from Kaggle.
Upload the file into your notebook using the provided code.
Read the uploaded CSV into a Pandas DataFrame called all_app
.
Perform basic checks:
View the first 5 rows.
Check .info()
to understand column types and missing data.
Check column names.
13 columns are present.
Several columns are of object type even if they should be numeric.
:information_source: Milestone Checkpoint 1
:white_check_mark: Progress Check: You should now have:
Successfully imported the dataset
Viewed the data structure and column types
Identified columns that need type conversion
Remove columns: Size\`,
Price, \`Content Rating
, Genres\`,
Last Updated, \`Current Ver
, and Android Ver
.
Save the changes using inplace=True
.
Check if the App
column has duplicates.
Drop duplicated rows based on the App
column.
Check missing values across all columns.
Remove rows with missing values in the Rating
column.
Validate that ratings are between 1 to 5.
Special check: Remove any row where the rating value is not valid (e.g., rating = 19).
Convert the Reviews
column to integer type.
Clean the Installs
column:
Remove commas ,\` and plus signs
+`.
Convert the cleaned values to integers.
:bulb: Debugging Tip
If you encounter errors when converting data types:
Check for non-numeric values using pd.to_numeric(errors='coerce')
Use .str.replace()
to remove special characters before conversion
Handle NaN values appropriately before type conversion
Save the cleaned dataset as cleaned_GPS_data.csv
.
Download the cleaned CSV for future use.
:information_source: Milestone Checkpoint 2
:white_check_mark: Progress Check: You should now have:
Removed unnecessary columns
Cleaned duplicate entries
Handled missing values
Converted data types correctly
Exported a clean dataset ready for analysis
Re-upload and read cleaned_GPS_data.csv\` into a DataFrame named
cleaned_data`.
View first 5 rows.
Check .info()
and confirm column types.
Remove the unnecessary column Unnamed: 0
.
Filter the data to Free apps only.
Sort the data in descending order by Installs\` and
Reviews`.
Display the top 10 rows.
A table of the top 10 free apps based on installs and reviews.
Group the data by Category
.
Sum the Installs
per category.
Sort the results in descending order.
Display the top 5 categories.
A table listing the 5 most installed app categories.
:bulb: Common Errors & Solutions
Error: "KeyError: 'Category'"
Solution: Check column names with df.columns
- there might be extra spaces
Error: "ValueError: invalid literal for int()"
Solution: Ensure all data cleaning steps were completed before analysis
Error: Empty results after filtering
Solution: Check filter conditions and ensure 'Type' column contains 'Free' values
Import the matplotlib.pyplot
library.
Plot a bar chart showing:
X-axis: Categories (top 5)
Y-axis: Total Installs
Add:
Title: "The Number of Apps Installed According to Categories"
X-axis label: "Categories"
Y-axis label: "Number of Apps"
A simple bar graph displaying categories and their install counts.
:information_source: Milestone Checkpoint 3
:white_check_mark: Progress Check: You've completed the coding portion! You should now have:
Identified top 10 free apps with most installs and reviews
Found top 5 categories by install count
Created a visualization showing category distribution
All results ready for your report
You need to use the data "P4: Google Play Store (Part 2)" to make this report.
You need to make a copy of the "Report Template" as "[Your name ] Google Play Store Analysis Report '' in your Colab Notebooks folder.
In general, the report will be divided into 4 components :
Component 1: Cover
Component 2: Overview
Component 3: Results
Component 4: Conclusion
Cover of the report consists of:
Logo: Already in template.
Title: Already in template.
Author information: Fill in your name and class slot .
Overview of the report consist of 2 paragraphs:
Background: Write about where you get the data, how you clean your data.
Objectives: Write about what you want to study in this report(refers to your results)
This report consists of 2 or more results:
Result 1: Top 10 most installed free apps with the highest number of reviews.
Result 2: Top 5 categories with the highest number of installed apps.
Each result consists of:
Table/Graph: Screenshot any useful tables and graphs from the project Google Play Store part 2.
Observations: Write the observations that you can see in the table/graph.
Reasoning: Find the reason for the thing observed.
Support: Find resources to support your reason.
:bulb: Report Writing Tips
Use specific numbers from your analysis (e.g., "Facebook has 974 million installs")
Connect your findings to real-world trends
Cite credible sources to support your reasoning
Keep observations objective and data-driven
Conclusion of report consists of 2 paragraphs:
Summary: Conclude the results in short.
Plan for next action: What is the action to be taken after knowing this report.
Write the title and link of the reference:
e.g.
Google Play, https://en.wikipedia.org/wiki/Google_Play
Make 1 copy of the Report Template
Move the copy to your Colab Notebooks in Google Drive
The sample answer is for REFERENCE only. Write your sentence in the report. DON'T COPY the sample answer.
Report Sample Answer
Please download googleplaystore.csv from Kaggle before you start coding. Do refer to the video on how to download the CSV file.
Code with AI: Try using AI to assist with various aspects of the Google Play Store data analysis project.
Data Loading and Exploration:
"Write Python code using Pandas to load the Google Play Store dataset from a CSV file named 'googleplaystore.csv'."
"How do I display the first 5 rows of the Google Play Store DataFrame using Pandas?"
"Generate Python code to get summary statistics (mean, median, etc.) for the 'Rating' column in the Google Play Store dataset."
Data Cleaning and Preprocessing:
"Write Python code to identify and remove any missing values in the 'Rating' column of the Google Play Store dataset."
"How do I convert the 'Installs' column to a numeric data type in Pandas, removing characters like '+' and ','?"
"Generate Python code to remove duplicate entries from the Google Play Store dataset based on the 'App' column."
Data Analysis and Manipulation:
"Write Python code to find the top 10 most installed free apps in the Google Play Store dataset."
"How can I filter the dataset to only include apps in the 'FAMILY' category?"
"Generate Python code to calculate the average rating for each category in the Google Play Store dataset."
Data Visualization:
"Create a bar chart in Matplotlib showing the number of apps in each category in the Google Play Store dataset."
"Generate Python code to create a scatter plot showing the relationship between 'Rating' and 'Reviews' in the Google Play Store dataset."
"How can I customize the appearance (labels, titles, colors) of a Matplotlib chart?"
Code with AI: Try using AI to generate insights and refine your report.
Generating Insights:
"Based on this data [DataFrame or summary statistics], what are some key trends or patterns in the Google Play Store market?"
"What are the most popular app categories based on the number of installs?"
"Generate a summary of the key findings from the Google Play Store data analysis."
Refining the Report:
"Suggest ways to improve the clarity and conciseness of this paragraph: [paste paragraph text]."
"Generate a concluding paragraph for my Google Play Store analysis report."
"Proofread and edit the following text for grammar and style: [paste text]."
Exploring Further Questions:
"What other factors might influence app ratings or installs, besides those analyzed in this project?"
"Suggest further research questions related to the Google Play Store dataset."
Ready to take your analysis further? Try these advanced challenges:
Create a scatter plot showing the relationship between app ratings and number of reviews
Build a heatmap showing install counts across different categories and content ratings
Design an interactive dashboard using Plotly or Seaborn
Calculate correlation coefficients between different variables
Perform hypothesis testing on app ratings across categories
Implement regression analysis to predict app success
Build a classification model to predict if an app will be successful
Create clusters of similar apps using K-means clustering
Develop a recommendation system for apps
Analyze seasonal trends in app updates
Compare free vs paid app performance
Identify market gaps and opportunities for new apps
:memo: Pro Tip
Document your extension work in a separate notebook section. This demonstrates initiative and advanced skills that can set your project apart!