Apply your knowledge to build something amazing!
:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 3-4 hours
Skills Practiced:
- Data cleaning and preprocessing
- Feature engineering with LabelEncoder
- Data scaling with MinMaxScaler
- Classification algorithms (KNN, Decision Tree, Naive Bayes)
- Model evaluation and comparison
- Python programming with pandas, numpy, and scikit-learn
In this project, you will use your knowledge in Machine Learning Process and Classification to create a classification model to predict whether a passenger survived the titanic incident based on a few selected features. We have gone through all the steps shown below. Try to code them out manually.
On April 15, 1912, one of the largest passenger liners, Titanic, sank when it collided with a massive iceberg during its voyage. During this incident, 1502 out of the 2224 passengers did not manage to survive.
This dataset contains information about 891 different passengers that were onboard the ship during the incident.
The following information about the passengers is in the dataset:
Using what you have learnt, try to build a ML model that can predict whether a passenger survived the incident based on the above information.
graph TD
A[Phase 1: Import & Load Data] --> B[Phase 2: Explore Data]
B --> C[Phase 3: Clean Data]
C --> D[Phase 4: Preprocess Data]
D --> E[Phase 5: Split Data]
E --> F[Phase 6: Build Models]
F --> G[Phase 7: Evaluate Models]
G --> H[Advanced Challenges]
style A fill:#e1f5fe
style B fill:#b3e5fc
style C fill:#81d4fa
style D fill:#4fc3f7
style E fill:#29b6f6
style F fill:#03a9f4
style G fill:#039be5
style H fill:#0288d1
Make a copy of the template file found here and rename the copied file as "P2: TitanicSurvivorsClassification".
Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.
# Download the Titanic dataset
!wget https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv -O titanic_data.csv
!wget https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic_test.csv -O titanic_test.csv
You may start to do your coding here.
:bulb: Best Practice Always import all required libraries at the beginning of your notebook. This makes it easier to track dependencies and ensures your code runs smoothly from top to bottom.
Import the necessary libraries to complete the project. These include:
Run the next set of imports to ensure that all the necessary libraries are installed.
[ ] import seaborn as sns
Read the csv file "titanic_data.csv" and save it as labelled_dataset.
Read the csv file "titanic_test.csv" and save it as non_labelled_dataset.
:warning: Common Pitfall Make sure both CSV files have been downloaded successfully before trying to read them. If you get a "File not found" error, re-run the download cell above.
Before proceeding to Phase 2, ensure that:
labelled_dataset
and non_labelled_dataset
contain dataPlease complete the following steps by referring to Chapter 5: Data Preparation.
:bulb: Understanding Your Data Data exploration is crucial! Take time to understand what each column represents and identify patterns. This will help you make better decisions during data cleaning and feature selection.
Inspect the shape of labelled_dataset and non_labelled_dataset.
Check the first 5 rows of labelled_dataset.
Check the first 5 rows of non_labelled_dataset.
Check information about labelled_dataset.
Check information of non_labelled_dataset.
Check how many missing values are there in the labelled_dataset.
Check how many missing values are there in the non_labelled_dataset.
After exploring the data, you should have discovered:
Please complete the following steps by referring to Chapter 5: Data Preparation.
:warning: Important Decision We're dropping the 'cabin' column because it has too many missing values (>75%). In real projects, always consider if missing data might contain valuable patterns before dropping!
Drop column 'cabin' from both labelled_dataset and non_labelled_dataset.
To fill up the missing values in the column 'age':
Run the following cell to generate the mean age for passengers according to the 4 honorifics
Generate a function that is tasked to automatically fill up any missing ages of the passengers according to their honorifics. (Remarks: Run the code cell shown below)
Run the following cell to automatically fill up the missing ages in labelled_dataset and non_labelled_dataset.
Delete missing data from the columns "Fare" and "Embarked" from labelled_dataset and non_labelled_dataset.
Finally, do a final check on how many missing values are there in both labelled_dataset and non_labelled_dataset. If data cleaning is done properly, there should be no more missing values in both datasets.
:bulb: Debugging Tip If you still have missing values after this step, check:
- Did you apply the age filling function to both datasets?
- Did you use
dropna()
on the correct columns?- Try using
.isnull().sum()
to see which columns still have missing data
Before moving to preprocessing:
Please complete the following steps by referring to Chapter 5: Data Preparation.
:information_source: Why Preprocessing? Machine learning algorithms work with numbers, not text. We need to:
- Encode categorical data (Sex, Embarked) into numbers
- Scale numerical features so they have similar ranges This ensures all features contribute equally to the model's predictions.
In the cell provided, encode the column "Sex" and "Embarked" by following the steps stated below:
# Print out the encoded classes along its features
print(f"Feature: {feature} Encoded Classes: {le.classes_}")
:warning: Common Encoding Error Make sure to fit the encoder on the labelled dataset first, then transform both datasets. Never fit the encoder separately on each dataset - this could result in different encodings!
Check to see whether the features have been properly encoded by printing out the first 5 rows of data in labelled_dataset.
In the cell provided, scale down the columns ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'] which are stored in the array features_to_be_scaled:
To check that the features have been properly scaled, print out the first 5 rows of data from labelled_dataset.
After preprocessing, verify that:
Please complete the following steps by referring to Chapter 5: Data Preparation.
Run the following cell to obtain the dataset (x) and its labels (y) from labelled_dataset.
:bulb: Understanding Train-Test Split The code automatically splits your data into training (80%) and testing (20%) sets. This allows you to:
- Train models on one portion of data
- Test their performance on unseen data
- Get a realistic estimate of how well they'll perform in the real world
Please complete the following steps by referring to Chapter 7: Classification.
:information_source: Model Selection We're training three different types of classifiers:
- KNN: Makes predictions based on the 'k' nearest neighbors
- Decision Tree: Creates a tree of if-then rules
- Naive Bayes: Uses probability and assumes feature independence
Each has different strengths, so we'll compare their performance!
Train a KNN Classification Model by following these steps:
If this step is done properly, the accuracy of clf_knn should be around 0.77.
Train a Decision Tree Classification Model by following these steps:
If this step is done properly, the accuracy of clf_dt should be around 0.78.
Train a Naive Bayes Classification Model by following these steps:
If this step is done properly, the accuracy of clf_nb should be around 0.79.
:bulb: Debugging Model Training If your accuracy scores are significantly different:
- Check that data preprocessing was done correctly
- Ensure you're using the right features for X
- Verify that train_test_split was called properly
- Try setting random_state=42 in train_test_split for reproducible results
You should now have three trained models with accuracies around:
:warning: Challenge Ahead! This section is more challenging! You'll use your best model to make predictions on completely new data. Take your time and refer back to your model training code if needed.
In this advanced challenge, we will make use of the best model out of the three, clf_knn, clf_dt and clf_nb to predict and label who among the passengers stored in non_labelled_dataset will and will not survive the titanic incident.
Predict the labels of input using the selected model (best model) and store it as predicted_data.
Assign the values of predicted_data to the column 'Survived' of the non_labelled_dataset.
Check to see whether the column "Survived" has been successfully added to the non_labelled_dataset by printing out the first 5 rows of data.
Please refer to Extra Chapter: Classification Model Evaluation to complete the challenge.
In this step, construct a table using the PrettyTable package to display the precision, recall and f1 scores of all 3 models - clf_knn, clf_dt and clf_nb.
table.field_names = ['Model', 'Precision', 'Recall', 'F1 Score']
In the for loop provided, for every model stored in models:
Ready to take your project further? Try these advanced challenges:
Congratulations on completing the Titanic Survivors Classification project! You've successfully:
Remember: In real-world projects, the process is often iterative. Don't be discouraged if your first attempt doesn't yield perfect results - keep experimenting and learning!