Apply your knowledge to build something amazing!
:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 2-3 hours
Skills Practiced:
- K-Means clustering algorithm
- Data exploration and visualization
- Finding optimal cluster numbers
- Customer segmentation analysis
- Python libraries (NumPy, Pandas, Matplotlib, Scikit-learn)
In this project, you will use your knowledge in Machine Learning Process and Clustering to create a clustering model to group different customers based on the selected features. We have gone through all the steps shown below. Try to code them out manually.
graph LR
A[Phase 1: Setup & Import] --> B[Phase 2: Data Exploration]
B --> C[Phase 3: Build K-Means Model]
C --> D[Phase 4: Visualize Clusters]
D --> E[Phase 5: Find Optimal K]
E --> F[Challenges: Apply Learning]
style A fill:#e8f4f8
style B fill:#d4e8f4
style C fill:#c0dcf0
style D fill:#acd0ec
style E fill:#98c4e8
style F fill:#84b8e4
You own a supermarket mall and through membership cards, you have some basic data about your customers like:
Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
You own the mall and want to understand the customers. One way to do that is to group different customers according to their characteristics and purchasing behavior. This information will be given to your marketing team to plan their marketing strategies.
:bulb: Setup Success Tip Before starting, ensure you have a stable internet connection for downloading the dataset and accessing Google Colab. If the download fails, try refreshing the page and running the download cell again.
Make a copy of the template file found here and rename the copied file as "P3: MallCustomerSegmentation".
Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.
import gdown
gdown.download('https://drive.google.com/uc?id=1BfvB7FuGVA5KbEnbINV5QmJOTGkF2YzE', 'Mall_Customers.csv', quiet=False)
You may start to do your coding here.
:white_check_mark: Success
Milestone Checkpoint 1:white_check_mark: Google Colab file created and renamed
:white_check_mark: Dataset downloaded successfully
:white_check_mark: Ready to start coding
:warning: Common Import Error If you encounter "ModuleNotFoundError", it means the library isn't installed. In Google Colab, most libraries come pre-installed. If needed, use
!pip install library_name
to install missing packages.
# Import statements with explanations
import numpy as np # For numerical operations
import pandas as pd # For data manipulation
import matplotlib.pyplot as plt # For creating visualizations
Run the next set of imports to ensure that all the necessary libraries are installed.
Read the csv file "Mall_Customers.csv" and save it as customer_data.
# Read the customer data
customer_data = pd.read_csv('Mall_Customers.csv')
:bulb: Best Practice Always check if your data loaded correctly by printing the first few rows immediately after reading the file!
Great job loading the data! Now let's explore it to understand what we're working with. :mag:
Check the first 5 rows of customer_data.
# Display the first 5 rows to understand the data structure
customer_data.head()
Expected output:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
Check the length and shape of customer_data.
# Check dimensions of our dataset
print(f"Shape: {customer_data.shape}")
print(f"Length: {len(customer_data)}")
Expected output:
Shape: (200, 5)
Length: 200
Check the information of customer_data.
# Get detailed information about the dataset
customer_data.info()
Expected output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Gender 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
Check how many missing values are there in the customer_data.
# Check for missing values - important for data quality!
customer_data.isnull().sum()
Expected output:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
:white_check_mark: Success
Milestone Checkpoint 2:white_check_mark: Data loaded successfully
:white_check_mark: No missing values found
:white_check_mark: Dataset has 200 customers with 5 features
:white_check_mark: Ready for clustering analysis
Now for the exciting part - let's build our clustering model! :rocket:
:warning: Common Pitfall Remember to use
.values
when extracting data from a pandas DataFrame for sklearn. This converts the data to a NumPy array, which sklearn expects.
Train a model to cluster customers according to their age and spending score.
Obtain the data from "Age" and "Spending Score" from customer_data. (Remarks: You may run the following codes)
# Obtain the age & spending score of the customers
X = customer_data[['Age', 'Spending Score (1-100)']].values
Import kmeans from sklearn.
# Import KMeans from sklearn's cluster module
from sklearn.cluster import KMeans
Create a KMeans Clustering Model with the variable name model with n_clusters = 5 and random_state = 42.
# Create the KMeans model
# n_clusters: number of groups we want to create
# random_state: ensures reproducible results
model = KMeans(n_clusters=5, random_state=42)
Fit model with dataset X.
# Train the model on our data
model.fit(X)
Find the inertia of the model.
# Inertia: sum of squared distances to closest cluster center
# Lower inertia = tighter clusters
print(f"Inertia of Model with K=5: {model.inertia_}")
Expected output:
Inertia of Model with K=5: 23838.24882164186
Find the centroids of the model.
# Centroids are the center points of each cluster
print(model.cluster_centers_)
Expected output:
array([[58.44444444, 50.52777778],
[41.48484848, 37. ],
[30.1754386 , 82.35087719],
[25.4 , 52.68571429],
[43.28205128, 11.84615385]])
Using the model, identify which cluster does the following customers belong to:
# Predict cluster for new customers
# Format: [[age, spending_score]]
customers = [[20, 42], [65, 81], [44, 100], [59, 23]]
predictions = model.predict(customers)
for i, (age, score) in enumerate(customers):
print(f"Customer {chr(65+i)} (Age: {age}, Spending Score: {score}) -> Cluster: {predictions[i]}")
Expected clusters:
:bulb: Understanding Clusters Each cluster represents a different customer segment. For example:
- Cluster with high age & high spending = Premium older customers
Time to visualize our clusters! A picture is worth a thousand data points. :bar_chart:
:warning: Visualization Pitfall Make sure to use different colors for each cluster and include a legend. Without proper labeling, your graph won't be interpretable!
Obtain the predicted labels of X and store them as labels.
# Get cluster labels for each data point
labels = model.labels_
Filter data into different clusters according to their cluster numbers, from cluster 0 to cluster 4.
# Create figure with good size
plt.figure(figsize=(10, 8))
# Define colors for each cluster
colors = ['red', 'blue', 'green', 'orange', 'purple']
# Plot each cluster
for i in range(5):
# Filter data points belonging to cluster i
cluster_data = X[labels == i]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1],
c=colors[i], label=f'Cluster {i}', s=50, alpha=0.7)
Plot out the clustering graph with titles, x label, y label and legends as shown below.
# Add cluster centers
plt.scatter(model.cluster_centers_[:, 0], model.cluster_centers_[:, 1],
c='black', marker='x', s=200, linewidths=3, label='Centroids')
# Add labels and title
plt.xlabel('Age', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.title('Customer Segmentation - Age vs Spending Score', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
:white_check_mark: Success
Milestone Checkpoint 3:white_check_mark: Model trained successfully
:white_check_mark: Clusters identified
:white_check_mark: Visualization created
:white_check_mark: Customer segments are now visible!
How do we know we chose the right number of clusters? Let's use the Elbow Method! :bulb:
:bulb: The Elbow Method The "elbow" is where the line starts to flatten out. This indicates diminishing returns - adding more clusters doesn't significantly improve the model.
Declare a list to store the inertia values called list_of_inertia.
# Initialize empty list for storing inertia values
list_of_inertia = []
By looping through K values from 1 to 10, obtain the lists of inertia by completing the following steps:
# Test different numbers of clusters
for k in range(1, 11):
# Create model with k clusters
model = KMeans(n_clusters=k, random_state=42)
# Fit the model
model.fit(X)
# Store the inertia value
list_of_inertia.append(model.inertia_)
# Optional: print progress
print(f"K={k}, Inertia={model.inertia_:.2f}")
Plot the graph of inertia against cluster numbers, k=1 to k=10.
# Create the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), list_of_inertia, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method For Optimal K', fontsize=14)
plt.grid(True, alpha=0.3)
# Add annotations for better understanding
for i, inertia in enumerate(list_of_inertia[:5]):
plt.annotate(f'{inertia:.0f}', (i+1, inertia), textcoords="offset points",
xytext=(0,10), ha='center')
plt.show()
Based on the graph, identify the ideal number of clusters. (No codes needed)
:information_source: Finding the Elbow Look for the point where the curve starts to flatten. In this case, it appears to be around K=3 or K=4. This is where we get the best balance between model complexity and performance.
:white_check_mark: Success
Milestone Checkpoint 4:white_check_mark: Completed all phases successfully
:white_check_mark: Found optimal number of clusters
:white_check_mark: Ready for challenges!
Using the value K identified from Phase 5, generate another clustering model based on the customers' age and spending score. You may follow the steps stated as below.
:bulb: Challenge Approach Think about what you learned from the elbow method. The optimal K is likely 3 or 4. Try both and see which gives better-defined clusters!
You may start your codes here.
# Your code here:
# 1. Create new model with optimal K
# 2. Fit the model
# 3. Get labels
# 4. Create visualization
Expected graph:
Generate a clustering model based on the customers' annual income (in k$) and spending score. You may follow the steps stated as below:
:warning: Different Features Alert Notice we're using different features now - Annual Income instead of Age. This will reveal different customer patterns!
You may start your codes here.
# Hint: Start by extracting the right features
# X_income = customer_data[['Annual Income (k$)', 'Spending Score (1-100)']].values
:bulb: Business Insight This analysis helps identify:
- High income, high spenders (VIP customers)
The kneedle algorithm is an algorithm made to find the knee and elbow points in a graph. Similar implementation was implemented in Python using the package kneed. Using the kneed package, identify the ideal K value for the model.
:information_source: Advanced Challenge This challenge introduces an automated way to find the optimal K value. The kneed library mathematically identifies the "elbow point" for you!
You may refer to this kaggle project to complete your challenge. You can start your codes here.
# First, install the kneed package if needed:
# !pip install kneed
# Then import and use:
# from kneed import KneeLocator
Import Error: "No module named 'sklearn'"
!pip install scikit-learn
in a cellShape Mismatch Error: When predicting clusters
print(X.shape)
to verify dimensionsEmpty Cluster Warning: Some clusters have no points
Memory Error: Dataset too large
customer_data.sample(n=100)
# Check data shape
print(f"Data shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")
# Verify model is fitted
print(f"Model fitted: {hasattr(model, 'labels_')}")
# Check cluster distribution
unique, counts = np.unique(labels, return_counts=True)
print(f"Cluster distribution: {dict(zip(unique, counts))}")
Ready to go beyond? Try these advanced challenges:
Create a 3D cluster visualization using Age, Annual Income, and Spending Score:
# Hint: Use matplotlib's 3D plotting
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
For each cluster, create a detailed profile:
Evaluate clustering quality using silhouette score:
from sklearn.metrics import silhouette_score
# Calculate silhouette score for different K values
Write a function that takes a cluster number and returns marketing strategies:
def marketing_strategy(cluster_num):
# Return personalized marketing recommendations
pass
Create a function that instantly assigns new customers to clusters:
def assign_customer_segment(age, income, spending_score):
# Predict and return cluster with description
pass
:bulb: Project Completion Checklist
- All phases completed with expected outputs matching
- Visualizations are clear and labeled
Remember: Customer segmentation is a powerful tool for businesses. You're learning skills that real data scientists use every day! Keep experimenting and don't be afraid to try different approaches.
You've got this! :muscle: