Practice and reinforce the concepts from Lesson 8
:computer: Activity Type: Hands-on coding exercise ⏱️ Total Time: 45-60 minutes
Access the exercise notebook here: KMeans Clustering Colab Notebook
:bulb: Before you begin: Make sure you're signed in to your Google account to save a copy of the notebook!
Exercise Overview
Part One: KMeans with a Simple Dataset
⏱️ Time: 15-20 minutes
In this part, you will practice basic KMeans clustering on a small dataset.
Step-by-step instructions:
- Import the necessary libraries
- numpy for data handling
- sklearn.cluster for KMeans
- matplotlib.pyplot for visualization
- Create and explore the sample dataset
- Review the provided data points
- Understand the 2D structure
- Apply KMeans clustering
- Initialize KMeans with
n_clusters=2
- Set
random_state=42
for reproducibility- Fit the model to your data
- Evaluate the model
- Check the cluster centroids
- Calculate the inertia (sum of squared distances)
- Make predictions
- Predict the cluster for point
[10, 5]
- Understand which cluster it belongs to
- Visualize the results
- Plot the data points colored by cluster
- Mark the centroids clearly
- Add the predicted point tip Visualization tip: Use different colors for each cluster and mark centroids with a special symbol (like 'X' or '*') to make them stand out!
⏱️ Time: 25-30 minutes
Now you'll work with a real-world dataset to see how clustering performs on complex data.
Step-by-step instructions:
Download and load the dataset
gdown
to download Live.csv
Explore the dataset
Select features for clustering
Apply KMeans clustering
n_clusters=4
random_state=42
for consistencyAnalyze the results
Make a prediction
[529, 200, 80, 430, 100, 20, 50, 6, 3]
Filter and analyze by cluster
Visualize the clusters
:bulb: Data tip: If you have many features, consider using the first two principal components for visualization, or choose the two most meaningful features for your analysis.
:warning: Warning Common challenge: Make sure your feature array matches the order of features used during training when making predictions!
Part 3: Advanced Challenges (Optional)
⏱️ Time: 10-15 minutes
For those who finish early or want extra practice:
- Optimal cluster selection
- Implement the elbow method
- Plot inertia vs. number of clusters
- Find the optimal
n_clusters
- Dimensionality reduction
- Apply PCA before clustering
- Compare results with and without PCA
- Visualize in the reduced space
- Alternative clustering
- Try different initialization methods
- Experiment with MiniBatchKMeans for larger datasets tip Elbow method hint: Look for the "elbow" point where adding more clusters doesn't significantly reduce inertia.
:warning: Warning Important: Save your completed notebook before submitting!
Please submit your work through this link: Exercise Submission Form
Complete all required sections
Verify your code
Use consistent parameters
random_state=42
n_clusters
valuesSubmit your work
:information_source: Submission deadline: Check with your instructor for the due date. Late submissions may not receive full credit.
Common issues and solutions:
Import errors
!pip install scikit-learn matplotlib pandas
if neededData loading issues
!pip install gdown
Visualization problems
plt.figure()
before plottingplt.show()
to display plotsPrediction errors
[[10, 5]]
:bulb: Tip Need help? Post questions in the course discussion forum or attend office hours!