Demo Mode

Project 3 of 7

Project 3: Mall Customer Segmentation

Apply your knowledge to build something amazing!

Mall Customer Segmentation Project

:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 2-3 hours
Skills Practiced:

K-Means clustering algorithm

Data exploration and visualization

Finding optimal cluster numbers

Customer segmentation analysis

Python libraries (NumPy, Pandas, Matplotlib, Scikit-learn)

In this project, you will use your knowledge in Machine Learning Process and Clustering to create a clustering model to group different customers based on the selected features. We have gone through all the steps shown below. Try to code them out manually.

Project Roadmap

mermaid

graph LR
    A[Phase 1: Setup & Import] --> B[Phase 2: Data Exploration]
    B --> C[Phase 3: Build K-Means Model]
    C --> D[Phase 4: Visualize Clusters]
    D --> E[Phase 5: Find Optimal K]
    E --> F[Challenges: Apply Learning]
    
    style A fill:#e8f4f8
    style B fill:#d4e8f4
    style C fill:#c0dcf0
    style D fill:#acd0ec
    style E fill:#98c4e8
    style F fill:#84b8e4

Project Overview

You own a supermarket mall and through membership cards, you have some basic data about your customers like:

Customer ID
Gender
Age
Annual Income
Spending Score

Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

You own the mall and want to understand the customers. One way to do that is to group different customers according to their characteristics and purchasing behavior. This information will be given to your marketing team to plan their marketing strategies.

Getting Started

:bulb: Setup Success Tip Before starting, ensure you have a stable internet connection for downloading the dataset and accessing Google Colab. If the download fails, try refreshing the page and running the download cell again.

Make a copy of the template file found here and rename the copied file as "P3: MallCustomerSegmentation".
Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.
python
```
import gdown
gdown.download('https://drive.google.com/uc?id=1BfvB7FuGVA5KbEnbINV5QmJOTGkF2YzE', 'Mall_Customers.csv', quiet=False)
```
You may start to do your coding here.

:white_check_mark: Success
Milestone Checkpoint 1

:white_check_mark: Google Colab file created and renamed
:white_check_mark: Dataset downloaded successfully
:white_check_mark: Ready to start coding

Phase One: Import Dependencies & Read CSV Files

:warning: Common Import Error If you encounter "ModuleNotFoundError", it means the library isn't installed. In Google Colab, most libraries come pre-installed. If needed, use !pip install library_name to install missing packages.

Import the necessary libraries to complete the project. These include:
1. Numpy library for processing numerical arrays
2. Pandas for reading and manipulating dataset from csv files
3. Matplotlib for drawing and visualizing dataset

python

# Import statements with explanations
import numpy as np         # For numerical operations
import pandas as pd        # For data manipulation
import matplotlib.pyplot as plt  # For creating visualizations

Run the next set of imports to ensure that all the necessary libraries are installed.
Read the csv file "Mall_Customers.csv" and save it as customer_data.

python

# Read the customer data
customer_data = pd.read_csv('Mall_Customers.csv')

:bulb: Best Practice Always check if your data loaded correctly by printing the first few rows immediately after reading the file!

Phase 2: Explore the Data

Great job loading the data! Now let's explore it to understand what we're working with. :mag:

Check the first 5 rows of customer_data.

python

# Display the first 5 rows to understand the data structure
customer_data.head()

Expected output:

java

   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40

Check the length and shape of customer_data.

python

# Check dimensions of our dataset
print(f"Shape: {customer_data.shape}")
print(f"Length: {len(customer_data)}")

Expected output:

makefile

Shape: (200, 5)
Length: 200

Check the information of customer_data.

python

# Get detailed information about the dataset
customer_data.info()

Expected output:

java

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
#   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
0   CustomerID              200 non-null    int64 
1   Gender                  200 non-null    object
2   Age                     200 non-null    int64 
3   Annual Income (k$)      200 non-null    int64 
4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Check how many missing values are there in the customer_data.

python

# Check for missing values - important for data quality!
customer_data.isnull().sum()

Expected output:

java

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

:white_check_mark: Success
Milestone Checkpoint 2

:white_check_mark: Data loaded successfully
:white_check_mark: No missing values found
:white_check_mark: Dataset has 200 customers with 5 features
:white_check_mark: Ready for clustering analysis

Phase 3: KMeans Clustering Model

Now for the exciting part - let's build our clustering model! :rocket:

:warning: Common Pitfall Remember to use .values when extracting data from a pandas DataFrame for sklearn. This converts the data to a NumPy array, which sklearn expects.

Train a model to cluster customers according to their age and spending score.

Obtain the data from "Age" and "Spending Score" from customer_data. (Remarks: You may run the following codes)

python

# Obtain the age & spending score of the customers
X = customer_data[['Age', 'Spending Score (1-100)']].values

Import kmeans from sklearn.

python

# Import KMeans from sklearn's cluster module
from sklearn.cluster import KMeans

Create a KMeans Clustering Model with the variable name model with n_clusters = 5 and random_state = 42.

python

# Create the KMeans model
# n_clusters: number of groups we want to create
# random_state: ensures reproducible results
model = KMeans(n_clusters=5, random_state=42)

Fit model with dataset X.

python

# Train the model on our data
model.fit(X)

Find the inertia of the model.

python

# Inertia: sum of squared distances to closest cluster center
# Lower inertia = tighter clusters
print(f"Inertia of Model with K=5: {model.inertia_}")

Expected output:

ini

Inertia of Model with K=5: 23838.24882164186

Find the centroids of the model.

python

# Centroids are the center points of each cluster
print(model.cluster_centers_)

Expected output:

css

array([[58.44444444, 50.52777778],
       [41.48484848, 37.        ],
       [30.1754386 , 82.35087719],
       [25.4        , 52.68571429],
       [43.28205128, 11.84615385]])

Using the model, identify which cluster does the following customers belong to:
python
```
# Predict cluster for new customers
# Format: [[age, spending_score]]
customers = [[20, 42], [65, 81], [44, 100], [59, 23]]
predictions = model.predict(customers)

for i, (age, score) in enumerate(customers):
    print(f"Customer {chr(65+i)} (Age: {age}, Spending Score: {score}) -> Cluster: {predictions[i]}")
```
Expected clusters:
1. Customer A with age 20 and spending score of 42 (Expected Answer: 3)
2. Customer B with age 65 and spending score of 81 (Expected Answer: 0)
3. Customer C with age 44 and spending score of 100 (Expected Answer: 2)
4. Customer D with age 59 with spending score of 23 (Expected Answer: 4)

:bulb: Understanding Clusters Each cluster represents a different customer segment. For example:

Cluster with high age & high spending = Premium older customers

Cluster with low age & high spending = Young big spenders

Phase 4: Plot clustering graph

Time to visualize our clusters! A picture is worth a thousand data points. :bar_chart:

:warning: Visualization Pitfall Make sure to use different colors for each cluster and include a legend. Without proper labeling, your graph won't be interpretable!

Obtain the predicted labels of X and store them as labels.

python

# Get cluster labels for each data point
labels = model.labels_

Filter data into different clusters according to their cluster numbers, from cluster 0 to cluster 4.

python

# Create figure with good size
plt.figure(figsize=(10, 8))

# Define colors for each cluster
colors = ['red', 'blue', 'green', 'orange', 'purple']

# Plot each cluster
for i in range(5):
    # Filter data points belonging to cluster i
    cluster_data = X[labels == i]
    plt.scatter(cluster_data[:, 0], cluster_data[:, 1], 
               c=colors[i], label=f'Cluster {i}', s=50, alpha=0.7)

Plot out the clustering graph with titles, x label, y label and legends as shown below.

python

# Add cluster centers
plt.scatter(model.cluster_centers_[:, 0], model.cluster_centers_[:, 1], 
           c='black', marker='x', s=200, linewidths=3, label='Centroids')

# Add labels and title
plt.xlabel('Age', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.title('Customer Segmentation - Age vs Spending Score', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

:white_check_mark: Success
Milestone Checkpoint 3

:white_check_mark: Model trained successfully
:white_check_mark: Clusters identified
:white_check_mark: Visualization created
:white_check_mark: Customer segments are now visible!

Phase 5: Finding the ideal number of clusters

How do we know we chose the right number of clusters? Let's use the Elbow Method! :bulb:

:bulb: The Elbow Method The "elbow" is where the line starts to flatten out. This indicates diminishing returns - adding more clusters doesn't significantly improve the model.

Declare a list to store the inertia values called list_of_inertia.

python

# Initialize empty list for storing inertia values
list_of_inertia = []

By looping through K values from 1 to 10, obtain the lists of inertia by completing the following steps:

python

# Test different numbers of clusters
for k in range(1, 11):
    # Create model with k clusters
    model = KMeans(n_clusters=k, random_state=42)
    
    # Fit the model
    model.fit(X)
    
    # Store the inertia value
    list_of_inertia.append(model.inertia_)
    
    # Optional: print progress
    print(f"K={k}, Inertia={model.inertia_:.2f}")

Plot the graph of inertia against cluster numbers, k=1 to k=10.

python

# Create the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), list_of_inertia, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method For Optimal K', fontsize=14)
plt.grid(True, alpha=0.3)

# Add annotations for better understanding
for i, inertia in enumerate(list_of_inertia[:5]):
    plt.annotate(f'{inertia:.0f}', (i+1, inertia), textcoords="offset points", 
                xytext=(0,10), ha='center')

plt.show()

Based on the graph, identify the ideal number of clusters. (No codes needed)

:information_source: Finding the Elbow Look for the point where the curve starts to flatten. In this case, it appears to be around K=3 or K=4. This is where we get the best balance between model complexity and performance.

:white_check_mark: Success
Milestone Checkpoint 4

:white_check_mark: Completed all phases successfully
:white_check_mark: Found optimal number of clusters
:white_check_mark: Ready for challenges!

Challenge 1

Using the value K identified from Phase 5, generate another clustering model based on the customers' age and spending score. You may follow the steps stated as below.

:bulb: Challenge Approach Think about what you learned from the elbow method. The optimal K is likely 3 or 4. Try both and see which gives better-defined clusters!

Generate the clustering model using the ideal K value as the number of clusters. (refer to phase 3).
Plot the clustering graph. (refer to phase 4)

You may start your codes here.

python

# Your code here:
# 1. Create new model with optimal K
# 2. Fit the model
# 3. Get labels
# 4. Create visualization

Expected graph:

Challenge 2

Generate a clustering model based on the customers' annual income (in k$) and spending score. You may follow the steps stated as below:

:warning: Different Features Alert Notice we're using different features now - Annual Income instead of Age. This will reveal different customer patterns!

Find the ideal K value (refer to phase 5)
Generate the clustering model using the ideal K value as the number of clusters. (refer to phase 3).
Plot the clustering graph. (refer to phase 4)

You may start your codes here.

python

# Hint: Start by extracting the right features
# X_income = customer_data[['Annual Income (k$)', 'Spending Score (1-100)']].values

:bulb: Business Insight This analysis helps identify:

High income, high spenders (VIP customers)

High income, low spenders (potential targets)
Low income, high spenders (credit risk?)
Low income, low spenders (budget conscious)

Challenge 3

The kneedle algorithm is an algorithm made to find the knee and elbow points in a graph. Similar implementation was implemented in Python using the package kneed. Using the kneed package, identify the ideal K value for the model.

:information_source: Advanced Challenge This challenge introduces an automated way to find the optimal K value. The kneed library mathematically identifies the "elbow point" for you!

You may refer to this kaggle project to complete your challenge. You can start your codes here.

python

# First, install the kneed package if needed:
# !pip install kneed

# Then import and use:
# from kneed import KneeLocator

Debugging Tips :wrench:

Common Issues and Solutions:

Import Error: "No module named 'sklearn'"
- Solution: Run !pip install scikit-learn in a cell
Shape Mismatch Error: When predicting clusters
- Solution: Ensure your input data has the same number of features as training data
- Check: print(X.shape) to verify dimensions
Empty Cluster Warning: Some clusters have no points
- Solution: Try different random_state values or reduce number of clusters
Memory Error: Dataset too large
- Solution: Use a sample of data: customer_data.sample(n=100)

Debugging Code Snippets:

python

# Check data shape
print(f"Data shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")

# Verify model is fitted
print(f"Model fitted: {hasattr(model, 'labels_')}")

# Check cluster distribution
unique, counts = np.unique(labels, return_counts=True)
print(f"Cluster distribution: {dict(zip(unique, counts))}")

Extension Challenges :rocket:

Ready to go beyond? Try these advanced challenges:

One. 3D Clustering

Create a 3D cluster visualization using Age, Annual Income, and Spending Score:

python

# Hint: Use matplotlib's 3D plotting
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

2. Customer Profiling

For each cluster, create a detailed profile:

Average age, income, and spending score
Gender distribution
Name each cluster (e.g., "Young Spenders", "Careful Savers")

3. Silhouette Analysis

Evaluate clustering quality using silhouette score:

python

from sklearn.metrics import silhouette_score
# Calculate silhouette score for different K values

4. Business Recommendations

Write a function that takes a cluster number and returns marketing strategies:

python

def marketing_strategy(cluster_num):
    # Return personalized marketing recommendations
    pass

5. Real-time Clustering

Create a function that instantly assigns new customers to clusters:

python

def assign_customer_segment(age, income, spending_score):
    # Predict and return cluster with description
    pass

Final Tips for Success :star2:

:bulb: Project Completion Checklist

All phases completed with expected outputs matching

Visualizations are clear and labeled

Optimal K value identified using elbow method
At least one challenge attempted
Code is commented and easy to understand
No error messages in final notebook

Remember: Customer segmentation is a powerful tool for businesses. You're learning skills that real data scientists use every day! Keep experimenting and don't be afraid to try different approaches.

You've got this! :muscle:

Project 3 of 7

Project 3: Mall Customer Segmentation

Apply your knowledge to build something amazing!

Mall Customer Segmentation Project

:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 2-3 hours
Skills Practiced:

K-Means clustering algorithm

Data exploration and visualization

Finding optimal cluster numbers

Customer segmentation analysis

Python libraries (NumPy, Pandas, Matplotlib, Scikit-learn)

Project Roadmap

mermaid

graph LR
    A[Phase 1: Setup & Import] --> B[Phase 2: Data Exploration]
    B --> C[Phase 3: Build K-Means Model]
    C --> D[Phase 4: Visualize Clusters]
    D --> E[Phase 5: Find Optimal K]
    E --> F[Challenges: Apply Learning]
    
    style A fill:#e8f4f8
    style B fill:#d4e8f4
    style C fill:#c0dcf0
    style D fill:#acd0ec
    style E fill:#98c4e8
    style F fill:#84b8e4

Project Overview

You own a supermarket mall and through membership cards, you have some basic data about your customers like:

Customer ID
Gender
Age
Annual Income
Spending Score

Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

Getting Started

:bulb: Setup Success Tip Before starting, ensure you have a stable internet connection for downloading the dataset and accessing Google Colab. If the download fails, try refreshing the page and running the download cell again.

Make a copy of the template file found here and rename the copied file as "P3: MallCustomerSegmentation".
Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.
python
```
import gdown
gdown.download('https://drive.google.com/uc?id=1BfvB7FuGVA5KbEnbINV5QmJOTGkF2YzE', 'Mall_Customers.csv', quiet=False)
```
You may start to do your coding here.

:white_check_mark: Success
Milestone Checkpoint 1

:white_check_mark: Google Colab file created and renamed
:white_check_mark: Dataset downloaded successfully
:white_check_mark: Ready to start coding

Phase One: Import Dependencies & Read CSV Files

:warning: Common Import Error If you encounter "ModuleNotFoundError", it means the library isn't installed. In Google Colab, most libraries come pre-installed. If needed, use !pip install library_name to install missing packages.

Import the necessary libraries to complete the project. These include:
1. Numpy library for processing numerical arrays
2. Pandas for reading and manipulating dataset from csv files
3. Matplotlib for drawing and visualizing dataset

python

# Import statements with explanations
import numpy as np         # For numerical operations
import pandas as pd        # For data manipulation
import matplotlib.pyplot as plt  # For creating visualizations

Run the next set of imports to ensure that all the necessary libraries are installed.
Read the csv file "Mall_Customers.csv" and save it as customer_data.

python

# Read the customer data
customer_data = pd.read_csv('Mall_Customers.csv')

:bulb: Best Practice Always check if your data loaded correctly by printing the first few rows immediately after reading the file!

Phase 2: Explore the Data

Great job loading the data! Now let's explore it to understand what we're working with. :mag:

Check the first 5 rows of customer_data.

python

# Display the first 5 rows to understand the data structure
customer_data.head()

Expected output:

java

   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40

Check the length and shape of customer_data.

python

# Check dimensions of our dataset
print(f"Shape: {customer_data.shape}")
print(f"Length: {len(customer_data)}")

Expected output:

makefile

Shape: (200, 5)
Length: 200

Check the information of customer_data.

python

# Get detailed information about the dataset
customer_data.info()

Expected output:

java

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
#   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
0   CustomerID              200 non-null    int64 
1   Gender                  200 non-null    object
2   Age                     200 non-null    int64 
3   Annual Income (k$)      200 non-null    int64 
4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Check how many missing values are there in the customer_data.

python

# Check for missing values - important for data quality!
customer_data.isnull().sum()

Expected output:

java

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

:white_check_mark: Success
Milestone Checkpoint 2

:white_check_mark: Data loaded successfully
:white_check_mark: No missing values found
:white_check_mark: Dataset has 200 customers with 5 features
:white_check_mark: Ready for clustering analysis

Phase 3: KMeans Clustering Model

Now for the exciting part - let's build our clustering model! :rocket:

:warning: Common Pitfall Remember to use .values when extracting data from a pandas DataFrame for sklearn. This converts the data to a NumPy array, which sklearn expects.

Train a model to cluster customers according to their age and spending score.

Obtain the data from "Age" and "Spending Score" from customer_data. (Remarks: You may run the following codes)

python

# Obtain the age & spending score of the customers
X = customer_data[['Age', 'Spending Score (1-100)']].values

Import kmeans from sklearn.

python

# Import KMeans from sklearn's cluster module
from sklearn.cluster import KMeans

Create a KMeans Clustering Model with the variable name model with n_clusters = 5 and random_state = 42.

python

# Create the KMeans model
# n_clusters: number of groups we want to create
# random_state: ensures reproducible results
model = KMeans(n_clusters=5, random_state=42)

Fit model with dataset X.

python

# Train the model on our data
model.fit(X)

Find the inertia of the model.

python

# Inertia: sum of squared distances to closest cluster center
# Lower inertia = tighter clusters
print(f"Inertia of Model with K=5: {model.inertia_}")

Expected output:

ini

Inertia of Model with K=5: 23838.24882164186

Find the centroids of the model.

python

# Centroids are the center points of each cluster
print(model.cluster_centers_)

Expected output:

css

array([[58.44444444, 50.52777778],
       [41.48484848, 37.        ],
       [30.1754386 , 82.35087719],
       [25.4        , 52.68571429],
       [43.28205128, 11.84615385]])

Using the model, identify which cluster does the following customers belong to:
python
```
# Predict cluster for new customers
# Format: [[age, spending_score]]
customers = [[20, 42], [65, 81], [44, 100], [59, 23]]
predictions = model.predict(customers)

for i, (age, score) in enumerate(customers):
    print(f"Customer {chr(65+i)} (Age: {age}, Spending Score: {score}) -> Cluster: {predictions[i]}")
```
Expected clusters:
1. Customer A with age 20 and spending score of 42 (Expected Answer: 3)
2. Customer B with age 65 and spending score of 81 (Expected Answer: 0)
3. Customer C with age 44 and spending score of 100 (Expected Answer: 2)
4. Customer D with age 59 with spending score of 23 (Expected Answer: 4)

:bulb: Understanding Clusters Each cluster represents a different customer segment. For example:

Cluster with high age & high spending = Premium older customers

Cluster with low age & high spending = Young big spenders

Phase 4: Plot clustering graph

Time to visualize our clusters! A picture is worth a thousand data points. :bar_chart:

:warning: Visualization Pitfall Make sure to use different colors for each cluster and include a legend. Without proper labeling, your graph won't be interpretable!

Obtain the predicted labels of X and store them as labels.

python

# Get cluster labels for each data point
labels = model.labels_

Filter data into different clusters according to their cluster numbers, from cluster 0 to cluster 4.

python

# Create figure with good size
plt.figure(figsize=(10, 8))

# Define colors for each cluster
colors = ['red', 'blue', 'green', 'orange', 'purple']

# Plot each cluster
for i in range(5):
    # Filter data points belonging to cluster i
    cluster_data = X[labels == i]
    plt.scatter(cluster_data[:, 0], cluster_data[:, 1], 
               c=colors[i], label=f'Cluster {i}', s=50, alpha=0.7)

Plot out the clustering graph with titles, x label, y label and legends as shown below.

python

# Add cluster centers
plt.scatter(model.cluster_centers_[:, 0], model.cluster_centers_[:, 1], 
           c='black', marker='x', s=200, linewidths=3, label='Centroids')

# Add labels and title
plt.xlabel('Age', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.title('Customer Segmentation - Age vs Spending Score', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

:white_check_mark: Success
Milestone Checkpoint 3

:white_check_mark: Model trained successfully
:white_check_mark: Clusters identified
:white_check_mark: Visualization created
:white_check_mark: Customer segments are now visible!

Phase 5: Finding the ideal number of clusters

How do we know we chose the right number of clusters? Let's use the Elbow Method! :bulb:

:bulb: The Elbow Method The "elbow" is where the line starts to flatten out. This indicates diminishing returns - adding more clusters doesn't significantly improve the model.

Declare a list to store the inertia values called list_of_inertia.

python

# Initialize empty list for storing inertia values
list_of_inertia = []

By looping through K values from 1 to 10, obtain the lists of inertia by completing the following steps:

python

# Test different numbers of clusters
for k in range(1, 11):
    # Create model with k clusters
    model = KMeans(n_clusters=k, random_state=42)
    
    # Fit the model
    model.fit(X)
    
    # Store the inertia value
    list_of_inertia.append(model.inertia_)
    
    # Optional: print progress
    print(f"K={k}, Inertia={model.inertia_:.2f}")

Plot the graph of inertia against cluster numbers, k=1 to k=10.

python

# Create the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), list_of_inertia, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method For Optimal K', fontsize=14)
plt.grid(True, alpha=0.3)

# Add annotations for better understanding
for i, inertia in enumerate(list_of_inertia[:5]):
    plt.annotate(f'{inertia:.0f}', (i+1, inertia), textcoords="offset points", 
                xytext=(0,10), ha='center')

plt.show()

Based on the graph, identify the ideal number of clusters. (No codes needed)

:information_source: Finding the Elbow Look for the point where the curve starts to flatten. In this case, it appears to be around K=3 or K=4. This is where we get the best balance between model complexity and performance.

:white_check_mark: Success
Milestone Checkpoint 4

:white_check_mark: Completed all phases successfully
:white_check_mark: Found optimal number of clusters
:white_check_mark: Ready for challenges!

Challenge 1

Using the value K identified from Phase 5, generate another clustering model based on the customers' age and spending score. You may follow the steps stated as below.

:bulb: Challenge Approach Think about what you learned from the elbow method. The optimal K is likely 3 or 4. Try both and see which gives better-defined clusters!

Generate the clustering model using the ideal K value as the number of clusters. (refer to phase 3).
Plot the clustering graph. (refer to phase 4)

You may start your codes here.

python

# Your code here:
# 1. Create new model with optimal K
# 2. Fit the model
# 3. Get labels
# 4. Create visualization

Expected graph:

Challenge 2

Generate a clustering model based on the customers' annual income (in k$) and spending score. You may follow the steps stated as below:

:warning: Different Features Alert Notice we're using different features now - Annual Income instead of Age. This will reveal different customer patterns!

Find the ideal K value (refer to phase 5)
Generate the clustering model using the ideal K value as the number of clusters. (refer to phase 3).
Plot the clustering graph. (refer to phase 4)

You may start your codes here.

python

# Hint: Start by extracting the right features
# X_income = customer_data[['Annual Income (k$)', 'Spending Score (1-100)']].values

:bulb: Business Insight This analysis helps identify:

High income, high spenders (VIP customers)

High income, low spenders (potential targets)
Low income, high spenders (credit risk?)
Low income, low spenders (budget conscious)

Challenge 3

:information_source: Advanced Challenge This challenge introduces an automated way to find the optimal K value. The kneed library mathematically identifies the "elbow point" for you!

You may refer to this kaggle project to complete your challenge. You can start your codes here.

python

# First, install the kneed package if needed:
# !pip install kneed

# Then import and use:
# from kneed import KneeLocator

Debugging Tips :wrench:

Common Issues and Solutions:

Import Error: "No module named 'sklearn'"
- Solution: Run !pip install scikit-learn in a cell
Shape Mismatch Error: When predicting clusters
- Solution: Ensure your input data has the same number of features as training data
- Check: print(X.shape) to verify dimensions
Empty Cluster Warning: Some clusters have no points
- Solution: Try different random_state values or reduce number of clusters
Memory Error: Dataset too large
- Solution: Use a sample of data: customer_data.sample(n=100)

Debugging Code Snippets:

python

# Check data shape
print(f"Data shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")

# Verify model is fitted
print(f"Model fitted: {hasattr(model, 'labels_')}")

# Check cluster distribution
unique, counts = np.unique(labels, return_counts=True)
print(f"Cluster distribution: {dict(zip(unique, counts))}")

Extension Challenges :rocket:

Ready to go beyond? Try these advanced challenges:

One. 3D Clustering

Create a 3D cluster visualization using Age, Annual Income, and Spending Score:

python

# Hint: Use matplotlib's 3D plotting
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

2. Customer Profiling

For each cluster, create a detailed profile:

Average age, income, and spending score
Gender distribution
Name each cluster (e.g., "Young Spenders", "Careful Savers")

3. Silhouette Analysis

Evaluate clustering quality using silhouette score:

python

from sklearn.metrics import silhouette_score
# Calculate silhouette score for different K values

4. Business Recommendations

Write a function that takes a cluster number and returns marketing strategies:

python

def marketing_strategy(cluster_num):
    # Return personalized marketing recommendations
    pass

5. Real-time Clustering

Create a function that instantly assigns new customers to clusters:

python

def assign_customer_segment(age, income, spending_score):
    # Predict and return cluster with description
    pass

Final Tips for Success :star2:

:bulb: Project Completion Checklist

All phases completed with expected outputs matching

Visualizations are clear and labeled

Optimal K value identified using elbow method
At least one challenge attempted
Code is commented and easy to understand
No error messages in final notebook

Remember: Customer segmentation is a powerful tool for businesses. You're learning skills that real data scientists use every day! Keep experimenting and don't be afraid to try different approaches.

You've got this! :muscle: