Demo Mode

Titanic Survivors Classification Project

:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 3-4 hours
Skills Practiced:

Data cleaning and preprocessing

Feature engineering with LabelEncoder

Data scaling with MinMaxScaler

Classification algorithms (KNN, Decision Tree, Naive Bayes)

Model evaluation and comparison

Python programming with pandas, numpy, and scikit-learn

In this project, you will use your knowledge in Machine Learning Process and Classification to create a classification model to predict whether a passenger survived the titanic incident based on a few selected features. We have gone through all the steps shown below. Try to code them out manually.

:emoji: Historical Context

On April 15, 1912, one of the largest passenger liners, Titanic, sank when it collided with a massive iceberg during its voyage. During this incident, 1502 out of the 2224 passengers did not manage to survive.

This dataset contains information about 891 different passengers that were onboard the ship during the incident.

The following information about the passengers is in the dataset:

Survived (1) / Not Survived (0)
Ticket Class, pclass (1, 2 and 3)
Sex (male or female)
Age (in years)
Number of siblings/spouses onboard, sibsp
Number of parents/children onboard, parch
Ticket number
Passenger Fare
Cabin Number
Port of Embarkation, embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

Using what you have learnt, try to build a ML model that can predict whether a passenger survived the incident based on the above information.

:bar_chart: Project Roadmap

mermaid

graph TD
    A[Phase 1: Import & Load Data] --> B[Phase 2: Explore Data]
    B --> C[Phase 3: Clean Data]
    C --> D[Phase 4: Preprocess Data]
    D --> E[Phase 5: Split Data]
    E --> F[Phase 6: Build Models]
    F --> G[Phase 7: Evaluate Models]
    G --> H[Advanced Challenges]
    
    style A fill:#e1f5fe
    style B fill:#b3e5fc
    style C fill:#81d4fa
    style D fill:#4fc3f7
    style E fill:#29b6f6
    style F fill:#03a9f4
    style G fill:#039be5
    style H fill:#0288d1

Getting Started

Make a copy of the template file found here and rename the copied file as "P2: TitanicSurvivorsClassification".

Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.

python

# Download the Titanic dataset
!wget https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv -O titanic_data.csv
!wget https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic_test.csv -O titanic_test.csv

You may start to do your coding here.

Phase One: Import Dependencies & Read CSV Files

:bulb: Best Practice Always import all required libraries at the beginning of your notebook. This makes it easier to track dependencies and ensures your code runs smoothly from top to bottom.

Import the necessary libraries to complete the project. These include:
1. Numpy library for processing numerical arrays
2. Pandas for reading and manipulating dataset from csv files
3. Matplotlib for drawing and visualizing dataset
Run the next set of imports to ensure that all the necessary libraries are installed.
python
```
[ ] import seaborn as sns
```
Read the csv file "titanic_data.csv" and save it as labelled_dataset.
Read the csv file "titanic_test.csv" and save it as non_labelled_dataset.

:warning: Common Pitfall Make sure both CSV files have been downloaded successfully before trying to read them. If you get a "File not found" error, re-run the download cell above.

:dart: Milestone Checkpoint 1

Before proceeding to Phase 2, ensure that:

All libraries are imported without errors
Both CSV files are loaded successfully
Variables labelled_dataset and non_labelled_dataset contain data

Phase 2: Explore the Data

Please complete the following steps by referring to Chapter 5: Data Preparation.

:bulb: Understanding Your Data Data exploration is crucial! Take time to understand what each column represents and identify patterns. This will help you make better decisions during data cleaning and feature selection.

Inspect the shape of labelled_dataset and non_labelled_dataset.
Check the first 5 rows of labelled_dataset.
Check the first 5 rows of non_labelled_dataset.
Check information about labelled_dataset.
Check information of non_labelled_dataset.
Check how many missing values are there in the labelled_dataset.
Check how many missing values are there in the non_labelled_dataset.

:dart: Milestone Checkpoint 2

After exploring the data, you should have discovered:

The shape of both datasets (number of rows and columns)
Which columns have missing values
The data types of each column
General patterns in the data

Phase 3: Data Cleaning

Please complete the following steps by referring to Chapter 5: Data Preparation.

:warning: Important Decision We're dropping the 'cabin' column because it has too many missing values (>75%). In real projects, always consider if missing data might contain valuable patterns before dropping!

Drop column 'cabin' from both labelled_dataset and non_labelled_dataset.
To fill up the missing values in the column 'age':
1. Run the following cell to generate the mean age for passengers according to the 4 honorifics
  1. Miss
  2. Mrs
  3. Mr
  4. Master
2. Generate a function that is tasked to automatically fill up any missing ages of the passengers according to their honorifics. (Remarks: Run the code cell shown below)
3. Run the following cell to automatically fill up the missing ages in labelled_dataset and non_labelled_dataset.
Delete missing data from the columns "Fare" and "Embarked" from labelled_dataset and non_labelled_dataset.
Finally, do a final check on how many missing values are there in both labelled_dataset and non_labelled_dataset. If data cleaning is done properly, there should be no more missing values in both datasets.

:bulb: Debugging Tip If you still have missing values after this step, check:

Did you apply the age filling function to both datasets?

Did you use dropna() on the correct columns?

Try using .isnull().sum() to see which columns still have missing data

:dart: Milestone Checkpoint 3

Before moving to preprocessing:

No missing values remain in either dataset
The 'cabin' column has been removed
Ages have been filled based on honorifics
Missing 'Fare' and 'Embarked' values have been removed

Phase 4: Data Preprocessing

Please complete the following steps by referring to Chapter 5: Data Preparation.

:information_source: Why Preprocessing? Machine learning algorithms work with numbers, not text. We need to:

Encode categorical data (Sex, Embarked) into numbers

Scale numerical features so they have similar ranges This ensures all features contribute equally to the model's predictions.

In the cell provided, encode the column "Sex" and "Embarked" by following the steps stated below:
1. Import and declare LabelEncoder as le.
2. For every feature stored in features_to_be_encoded:
  1. Fit labelled_dataset[feature] onto le.
  2. Encode labelled_dataset[feature] and reassign the encoded value back to labelled_dataset[feature].
  3. Encode non_labelled_dataset[feature] and reassign the encoded value back to non_labelled_dataset[feature].
  4. Print out the encoded classes. (Remarks: You may copy the following code)
python
```
# Print out the encoded classes along its features
print(f"Feature: {feature}   Encoded Classes: {le.classes_}")
```

:warning: Common Encoding Error Make sure to fit the encoder on the labelled dataset first, then transform both datasets. Never fit the encoder separately on each dataset - this could result in different encodings!

Check to see whether the features have been properly encoded by printing out the first 5 rows of data in labelled_dataset.
In the cell provided, scale down the columns ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'] which are stored in the array features_to_be_scaled:
1. Import and declare MinMaxScaler as scaler.
2. Fit scaler with the columns of labelled_dataset according to the feature names in feature_to_be_scaled. (Remarks: You may retrieve said columns using the code labelled_dataset[features_to_be_scaled])
3. Scale the values of labelled_dataset[features_to_be_scaled] using scaler and reassign it back to labelled_dataset[features_to_be_scaled].
4. Scale the values of non_labelled_dataset[features_to_be_scaled] using scaler and reassign it back to non_labelled_dataset[features_to_be_scaled].
To check that the features have been properly scaled, print out the first 5 rows of data from labelled_dataset.

:dart: Milestone Checkpoint 4

After preprocessing, verify that:

Sex and Embarked columns contain numbers (0, 1, 2) instead of text
All feature values are between 0 and 1 (scaled)
Both datasets have been transformed consistently

Phase 5: Data Splitting

Please complete the following steps by referring to Chapter 5: Data Preparation.

Run the following cell to obtain the dataset (x) and its labels (y) from labelled_dataset.
1. The values of x are obtained from the following columns of the dataset:
  1. Pclass
  2. Sex
  3. Age
  4. Sibsp
  5. Parch
  6. Embarked
2. The values of y are obtained from the column 'Survived' of the dataset.

:bulb: Understanding Train-Test Split The code automatically splits your data into training (80%) and testing (20%) sets. This allows you to:

Train models on one portion of data

Test their performance on unseen data

Get a realistic estimate of how well they'll perform in the real world

Phase 6: Build the Classification Models

Please complete the following steps by referring to Chapter 7: Classification.

:information_source: Model Selection We're training three different types of classifiers:

KNN: Makes predictions based on the 'k' nearest neighbors

Decision Tree: Creates a tree of if-then rules

Naive Bayes: Uses probability and assumes feature independence

Each has different strengths, so we'll compare their performance!

Train a KNN Classification Model by following these steps:
1. Import and declare KNeighboursClassifier with n_neighbors = 5 as clf_knn.
2. Train clf_knn using x_train and y_train.
3. Find the score of clf_knn using x_test and y_test.
If this step is done properly, the accuracy of clf_knn should be around 0.77.
Train a Decision Tree Classification Model by following these steps:
1. Import and declare DecisionTreeClassifier as clf_dt.
2. Train clf_dt using x_train and y_train.
3. Find the score of clf_dt using x_test and y_test.
If this step is done properly, the accuracy of clf_dt should be around 0.78.
Train a Naive Bayes Classification Model by following these steps:
1. Import and declare BernoulliNB as clf_nb.
2. Train clf_nb using x_train and y_train.
3. Find the score of clf_nb using x_test and y_test.
If this step is done properly, the accuracy of clf_nb should be around 0.79.

:bulb: Debugging Model Training If your accuracy scores are significantly different:

Check that data preprocessing was done correctly

Ensure you're using the right features for X

Verify that train_test_split was called properly

Try setting random_state=42 in train_test_split for reproducible results

:dart: Milestone Checkpoint 5

You should now have three trained models with accuracies around:

KNN: ~0.77
Decision Tree: ~0.78
Naive Bayes: ~0.79

Advanced Challenge One: Predict Label

:warning: Challenge Ahead! This section is more challenging! You'll use your best model to make predictions on completely new data. Take your time and refer back to your model training code if needed.

In this advanced challenge, we will make use of the best model out of the three, clf_knn, clf_dt and clf_nb to predict and label who among the passengers stored in non_labelled_dataset will and will not survive the titanic incident.

Run the following codes to retrieve the relevant features from non_labelled_dataset.

Predict the labels of input using the selected model (best model) and store it as predicted_data.
Assign the values of predicted_data to the column 'Survived' of the non_labelled_dataset.
Check to see whether the column "Survived" has been successfully added to the non_labelled_dataset by printing out the first 5 rows of data.

By manipulating the non_labelled_dataset, try to answer the following questions:
1. How many passengers have survived the titanic incident.
2. How many passengers have not survived the titanic incident.
3. How many males passengers have not survived the titanic incident.
4. How many passengers of ticket class 1 have survived the titanic incident.
5. State the names of 3 female passengers who did not survive the titanic incident.

Advanced Challenge 2: Model Evaluation

Please refer to Extra Chapter: Classification Model Evaluation to complete the challenge.

Evaluate the performance of clf_knn in the code cell provided by following the steps below:

Import precision_recall_fscore_support from sklearn.metrics.
Predict the labels of each data in x_test and store them as y_pred.
Obtain and print out the precision, recall and f1 score of clf_knn.

In this step, construct a table using the PrettyTable package to display the precision, recall and f1 scores of all 3 models - clf_knn, clf_dt and clf_nb.
1. Import PrettyTable from prettytable.
2. Store all classifiers clf_knn, clf_dt and clf_nb into a list called models.
3. Declare PrettyTable as table.
4. Assign the following list of field names to table:
python
```
table.field_names = ['Model', 'Precision', 'Recall', 'F1 Score']
```
In the for loop provided, for every model stored in models:

Use model to predict the labels of x_test and save the results in a variable called predictions.
Calculate the precision, recall, and f1 score for the current model using y_test.values and predictions. Store these metrics in a variable named metrics.
Add a new row to table containing the model's name, its precision, recall, and f1 score.
Print out table.

In this step, construct a confusion matrix for the model clf_knn by following these steps:
1. Import confusion_matrix and ConfusionMatrixDisplay from sklearn.
2. Use clf_knn to predict the labels of x_test and store it as predictions.
3. Generate a confusion matrix using y_test and predictions and store it as cm.
4. Plot out the confusion matrix using ConfusionMatrixDisplay with the following parameters:
  1. confusion_matrix = cm
  2. display_labels = clf_knn.classes_

:rocket: Extension Challenges

Ready to take your project further? Try these advanced challenges:

Challenge One: Feature Engineering

Create new features like:
- Family size (SibSp + Parch + 1)
- Is alone (1 if family size = 1, 0 otherwise)
- Title extraction from names (Mr., Mrs., Dr., etc.)
Test if these new features improve model accuracy

Challenge 2: Hyperparameter Tuning

Experiment with different values of k in KNN (try 3, 7, 9)
Try different max_depth values for Decision Tree
Use GridSearchCV to find optimal parameters automatically

Challenge 3: Advanced Models

Try Random Forest Classifier (ensemble of decision trees)
Implement Logistic Regression
Create a voting classifier that combines all models

Challenge 4: Data Visualization

Create visualizations showing:
- Survival rate by passenger class
- Age distribution of survivors vs non-survivors
- Correlation heatmap of features

Challenge 5: Cross-Validation

Implement k-fold cross-validation instead of a single train-test split
Compare model performances using cross-validation scores

:books: Key Takeaways

Congratulations on completing the Titanic Survivors Classification project! You've successfully:

Cleaned and preprocessed real-world data with missing values
Encoded categorical variables and scaled numerical features
Trained multiple classification models and compared their performance
Made predictions on new, unlabeled data
Evaluated models using various metrics

Remember: In real-world projects, the process is often iterative. Don't be discouraged if your first attempt doesn't yield perfect results - keep experimenting and learning!

Project 2: Titanic Survivors Classification

Project 2: Titanic Survivors Classification