Apply your knowledge to build something amazing!
:information_source: Project Overview Difficulty Level: Intermediate
Estimated Time: 3-4 hours (Part 1) + 2-3 hours (Part 2)
Skills Practiced:
- Natural Language Processing (NLP)
- Text preprocessing and feature extraction
- Machine Learning model training
- JSON data handling
- User interface development with Streamlit
In this project, you will use your knowledge in Natural Language Processing and Machine Learning to create a chatbot. You will need to create a new Google Colab Notebook named "P4_Chatbot.ipynb" before coding.
For Part 1, our focus would be text processing & model training using all the techniques that we have learned before in previous lessons.
graph TD
A[Start: Download Dataset] --> B[Phase 1: Setup Environment]
B --> C[Phase 2: Text Preprocessing]
C --> D[Phase 3: Process Dataset]
D --> E[Phase 4: Feature Extraction]
E --> F[Phase 5: Train ML Models]
F --> G[Phase 6: Test Models]
G --> H[Phase 7: Build UI]
H --> I[Part 2: Streamlit Web App]
style A fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#9f9,stroke:#333,stroke-width:2px
In our daily conversation, each sentence and words we choose to speak has its own intentions. There are situations where our sentences are different but the meaning we want to convey is the same. For example,
Sentence | Intent |
---|---|
Good Morning! | Greeting |
Hi! How are you today? | Greeting |
I am very sorry for troubling you. | Apology |
I apologize for my mistake. | Apology |
The chatbot is then programmed to provide an appropriate response in regards to the intents without considering how the sentences are formed. As example:
User Query | Chatbot Response |
---|---|
Good Morning! | Hi Human. How are you today? |
Hello there | |
Hi! |
By training a model that is able to recognize these intents behind every sentence, it can then be used by the chatbot to provide an appropriate response to the user's query.
In our dataset, the intents and chatbot response are provided in a json format as shown below:
text - the list of user / text inputs
intent - the intent of the user inputs
responses - the list of responses that the chatbot can provide based on identified intent
\{
"intent": "Greeting",
"text": [
"Hi",
"Hi there",
"Hola",
"Hello",
],
"responses": [
"Hi human, please tell me your Alex user",
"Hello human, please tell me your Alex user",
"Hola human, please tell me your Alex user"
]
\}
You may learn more about how a chatbot is created using this video: Link to Video
To learn more about the dataset, feel free to visit this link: https://www.kaggle.com/elvinagammed/chatbots-intent-recognition-dataset
Click on this link to download the dataset file in the form of json "intents.json": https://drive.google.com/file/d/1kd1J5KX5v6FEjr6sahivHFY38BwRJ-r-/view?usp=sharing
Click on the download icon to download the dataset:
For Google Colab users, upload your dataset into your colab file.
:warning: Common Pitfall Make sure the
intents.json
file is uploaded to your Colab environment before proceeding. If you skip this step, you'll get a "FileNotFoundError" when trying to read the JSON file!
Make a copy of the template file found here and rename the copied file as "P4: Chatbot".
Run the code here to download the dataset onto your colab file. Wait for the download process to complete before proceeding to the next step.
If the download fails, you may import the dataset "intents.json" into your file. For collab users, run the following codes to import json files:
from google.colab import files
upload = files.upload()
Import the necessary libraries which are numpy and pandas.
:bulb: Best Practice Always import your libraries at the beginning of your notebook! This makes it easy to see all dependencies at a glance and helps avoid "NameError" issues later.
:white_check_mark: Milestone One: Environment setup complete! You should now have the dataset loaded and libraries imported.
Please refer to Chapter 10: Text Preprocessing to complete the following codes.
Import nltk package
Download punkt tokenizer from nltk.
Import the class snowball from nltk.stem.
Declare snowballStemmer using the imported snowball function.
Within the function text_preprocessing, code out the following steps:
:bulb: Understanding Text Preprocessing Text preprocessing is like cleaning and organizing your words before analysis:
- Tokenization: Breaking sentences into individual words
- Stemming: Reducing words to their root form (e.g., "running" -> "run")
- Cleaning: Removing punctuation and special characters
Test the function text_preprocessing by running the function with the following sentence:
'We all agreed, it was a magnificent evening.'
Expected Output:
we all agre it was a magnific even
:warning: Debugging Tip If your output doesn't match, check:
- Did you convert all words to lowercase?
- Did you remove ALL punctuation (including commas and periods)?
- Is your stemmer initialized correctly?
:white_check_mark: Milestone 2: Text preprocessing function working correctly!
Run the following codes to extract the dataset from "intents.json".
# Import JSON package and extract all the data from the dataset.
import json
with open("intents.json") as f:
data = json.load(f)
Declare multiple lists as shown below. (Remarks: You may copy the following code)
intent_list = []
train_data = []
train_label = []
responses = {}
Within the following loop, write the codes:
:bulb: Loop Structure The loop provided in the template iterates through each intent in the dataset. For each intent, you need to:
- Process all the example texts
- Store the processed texts and their labels
- Build the responses dictionary
Print out the following values:
The dictionary responses contain all the corresponding replies that the chatbot will provide to the user according to the intent in the user's query. Print out the list of responses for the intent "Thanks" by indexing the dictionary responses.
:white_check_mark: Milestone 3: Dataset loaded and preprocessed successfully!
Please refer to Chapter 10: Text Preprocessing to complete the following codes.
Import the class CountVectorizer from Scikit-Learn library.
Declare the vectorizer object using the imported CountVectorizer.
Create a vocabulary for vectorizer by using the function fit with train_data as one of its parameters.
Assign vectorizer.get_feature_names_out() to a new array list_of_words.
Print out list_of_words.
By using the vectorizer, convert train_data into a bag of words train_data_bow.
:bulb: Understanding Bag of Words The Bag of Words (BoW) representation converts text into numbers:
- Each word in the vocabulary gets a position
- Each text is represented as a count of how many times each word appears
- This creates a numerical representation that ML models can understand
Now you can make a comparison between train_data and train_data_bow by trying to print out the value at index = 1 in train_data, the value at index = 1 in train_data_bow and its label in train_label.
print(train_data[1])
print(train_data_bow[1])
print(train_label[1])
Note that in the list_of_words, the word "hi" is positioned at index position 52 while the word "there" is positioned at index position 112. You can find out the words at each position with the following code.
print(list_of_words[52])
print(list_of_words[112])
:white_check_mark: Milestone 4: Feature extraction complete! Your text data is now in numerical format.
Please refer to Chapter 7: Classification to complete the following codes.
:bulb: Why Three Different Models? We're training three different classifiers to compare their performance:
- KNN: Simple and intuitive, but can be slow with large datasets
- Decision Tree: Fast and interpretable, good for understanding decisions
- Naive Bayes: Excellent for text classification, often the best choice for chatbots
Import KNN classifier from Scikit-Learn library.
Declare a KNN classifier clf_knn. Set the value of n_neighbors to 5.
Fit the classifier using train_data_bow and train_label.
Import Decision Tree classifier from Scikit-Learn library.
Declare a Decision Tree classifier clf_dt. Set the value of random_state to 33.
Fit the classifier using train_data_bow and train_label.
Import Multinomial Naive Bayes classifier from Scikit-Learn library
Declare a Naive Bayes classifier clf_nb.
Fit the classifier using train_data_bow and train_label.
:white_check_mark: Milestone 5: All three models trained successfully!
Please refer to Chapter 10: Text Preprocessing and Chapter 7: Classification to complete the following codes.
Test all 3 models using the example sentence "Hello There".
test_sentence = "Hello there"
Before doing prediction, remember to:
:warning: Important Processing Steps The test sentence must go through the EXACT same preprocessing as the training data:
- Apply text preprocessing (tokenize, stem, clean)
- Convert to Bag of Words using the SAME vectorizer
- The input to transform() must be a list!
:bulb: Testing Your Models After preprocessing, use each classifier's
predict()
method to see which intent each model predicts. Compare their results!
:white_check_mark: Milestone 6: Models tested and ready for integration!
In this phase, we will develop a simple interface for our chatbot.
Import package random and package datetime from datetime.
import random
from datetime import datetime
In general, the flow of the bot_respond function is as follows:
With this flow, complete the codes stated in step 2.
Declare a function bot_respond that receives a parameter named user_query. In the function bot_respond:
Use function text_preprocessing to tokenize and stem user_query.
After stemming, transform user_query into a bag of words named user_query_bow. Remember to store user_query as a list before applying the transform function.
From the three classifiers, clf_knn, clf_dt and clf_nb, select and assign clf_nb as clf.
Use the selected classifier clf to predict the intent of user_query_bow. Store the predicted intent in a variable predicted. Note that the predicted result is in the form of a Numpy Array.
Insert the following code into the function. The codes below will act to return a default response if it does not know the intent behind user input. (Remarks: The codes are already available in the template)
# Return default response if chatbot does not know what intent the user_query is about
max_proba = max(clf.predict_proba(user_query_bow)[0])
if max_proba < 0.08 and clf == clf_nb:
predicted = ['noanswer']
elif max_proba < 0.3 and not clf == clf_nb:
predicted = ['noanswer']
Declare an empty string bot_response.
For each intent, there are a number of different responses that the chatbot can choose from. Randomly generate a number chosenResponse that is within the range from 0 to the number of responses for the intent - 1. (Remarks: The codes are already available in the template)
# Randomly generate a number chosenResponse that is within the range from 0 to the number of responses for the intent
numOfResponses = len(responses[predicted[0]])
chosenResponse = random.randint(0, numOfResponses-1)
Based on chosenResponse, select the response from responses and assign it to bot_response. (Remark: The codes are already available in the template)
# Select the response from responses and assign it to bot_response
if predicted[0] == "TimeQuery":
bot_response = eval(responses[predicted[0]][chosenResponse])
else:
bot_response = responses[predicted[0]][chosenResponse]
Return bot_response.
Create a simple interface that accepts user input. (Remarks: The codes are already available in the template.)
# Simple interface for chatbot
print("This is Alex the chatbot. Say something!!")
while True:
try:
bot_input = input("You : ")
print("Alex :", bot_respond(bot_input))
except KeyboardInterrupt:
print("Alex : Thank you for choosing us. See you again soon!!")
break
Now it's time to test out your chatbot! Run the code and try to type in your query:
:bulb: Testing Your Chatbot Try these test phrases:
- Greetings: "Hello", "Hi there", "Good morning"
- Questions: "What time is it?", "What's your name?"
- Thanks: "Thank you", "Thanks a lot"
- Unknown: Try something not in the training data!
:white_check_mark: Milestone 7: Basic chatbot complete and working!
We had successfully implemented and developed a simple chatbot that could identify the intentions behind a user query and provide an appropriate answer. However, there are still some other features that we can implement into our chatbot. Using everything you have learnt so far, try to implement a feature that allows the chatbot to prompt the user for his username. After that, modify the function bot_respond to replace the word <HUMAN>
with the username.
:bulb: Implementation Hints
- Ask for username before the main chat loop starts
- Store the username in a variable
- Use string
.replace()
method in bot_respond function- Test with responses that contain
<HUMAN>
placeholder
As it stands, the chatbot is only able to recognize the intents that are stored in the dataset provided. As a bonus challenge, try to modify the dataset intents.json to include more queries with different intentions. One example of intents are as follows: (Remarks: make sure that the syntax is correct)
intent | FavouriteFood |
---|---|
user query | What is your favourite food ? |
What do you like to eat ? | |
What food do you like ? | |
What do you eat ? | |
responses | My favourite food is Nasi Lemak |
I love Nasi Lemak | |
I can eat Nasi Lemak everyday |
To access intents.json, you may open the file using the left navigation bar as shown below.
The file will be opened up and you may then modify the intents.json file.
After modifications, rerun the codes from phase 2 onwards to renew the chatbot model with the newly modified data.
:bulb: Research Challenge Create a comparison table showing:
- Accuracy of each model (KNN, Decision Tree, Naive Bayes)
- Response time for each model
- Which model works best for your chatbot and why?
Try to implement a simple context system where the chatbot remembers the last intent and can respond accordingly. For example:
In this project, you will use your knowledge in Natural Language Processing and Scikit-Learn to create a chatbot.
For Part 2, our focus would be to build a UI webpage for our chatbot using the package streamlit.
:bulb: What is Streamlit? Streamlit is a Python library that makes it super easy to create web applications! Think of it as turning your Python code into an interactive website with just a few lines of code. Perfect for showcasing your chatbot to friends and family!
** You may skip this section if you have downloaded the dataset from Part 1
Click on this link to download the dataset file in the form of json "intents.json": https://drive.google.com/file/d/1kd1J5KX5v6FEjr6sahivHFY38BwRJ-r-/view?usp=sharing
Click on the download icon to download the dataset:
Make sure you have downloaded and installed Visual Studio Code.
Watch this video to set up your VSCode.
On your laptop, create one new folder and name it "P3: Chatbot (Part 3)".
Move your intents.json file in that folder.
Create a new file and name it "chatbot.py".
Open the folder using VSCode.
Install all the libraries needed for this project using the Visual Studio Code command.
:warning: Installation Tips
Make sure you're in the correct directory in your terminal before installing! Use
cd
to navigate to your project folder first.
For streamlit library use the command "py -m pip install streamlit" in the terminal to install the library.
For pandas library use the command "py -m pip install pandas" in the terminal to install the library.
For numpy library use the command "py -m pip install numpy" in the terminal to install the library.
For nltk library use the command "py -m pip install nltk" in the terminal to install the library.
For sklearn library use the command "py -m pip install sklearn" in the terminal to install the library.
This project will have 2 parts:
- Creating the chatbot model.
- Designing the UI using streamlit.
Phase One: Import Packages
- Import streamlit library as st
- Import pandas library as pd
- Import the package numpy as np
- Import all the libraries needed to perform text processing that we have learned in Part 2. Copy and paste the code below in VS Code.
python
import json import nltk from nltk.stem import snowball from sklearn.feature_extraction.text import CountVectorizer from sklearn.tree import DecisionTreeClassifier import random from datetime import datetime
Phase 2: Declaration of Text Preprocessing Function
- You may copy the code below to declare list and dictionaries.
pythonintent_list = [] train_data = [] train_label = [] responses = {} list_of_words = []
- You may copy the code below to declare text processing function
python
nltk.download('punkt') snowballStemmer = snowball.SnowballStemmer("english") def text_preprocessing(sentence): # tokenize the sentences tokens = nltk.word_tokenize(sentence) # check the word is alphabet or number for token in tokens: if not token.isalnum(): tokens.remove(token) stem_tokens = [] for token in tokens: stem_tokens.append(snowballStemmer.stem(token.lower())) return " ".join(stem_tokens)
:white_check_mark: Milestone 1 (Part 2): Environment set up with all necessary imports!
Phase 3: Feature Extraction and Decision Tree Model
- Declare the vectorizer object using the imported CountVectorizer and declare a Decision Tree classifier clf_dt. Set the value of random_state to 33. You may copy the code below.
python
# Feature Extraction vectorizer = CountVectorizer() # Build NLP Model clf_dt = DecisionTreeClassifier(random_state=33)
Phase 4: Generate response
- Create a function bot_respond that receives a parameter named user_query. The function bot_respond will process the text, analyze and give response to user. You may copy the code below.
python
def bot_respond(user_query): # what user say user_query = text_preprocessing(user_query) user_query_bow = vectorizer.transform([user_query]) clf = clf_dt predicted = clf.predict(user_query_bow) # predict the intents # When model don't know the intent max_proba = max(clf.predict_proba(user_query_bow)[0]) if max_proba < 0.08 and clf == clf_nb: predicted = ['noanswer'] elif max_proba < 0.3 and not clf == clf_nb: predicted = ['noanswer'] bot_response = "" numOfResponses = len(responses[predicted[0]]) chosenResponse = random.randint(0, numOfResponses-1) if predicted[0] == "TimeQuery": bot_response = eval(responses[predicted[0]][chosenResponse]) else: bot_response = responses[predicted[0]][chosenResponse] return bot_response
:white_check_mark: Milestone 2 (Part 2): Core functions ready!
Phase 5: Function to load model
Create another function load_model() that will load training data from the intents.json file, extract the feature and train the data. We will call this function to process the user input later.
:bulb: Understanding load_model() This function does all the heavy lifting:
- Loads the intents from JSON
- Preprocesses all training data
- Creates the vocabulary (vectorizer)
- Trains the model It's like preparing your chatbot's brain before it starts chatting!
def load_model():
# import training data
with open("intents.json") as f:
data = json.load(f)
# load training data
for intent in data['intents']:
for text in intent['text']:
# Save the data sentences
preprocessed_text = text_preprocessing(text)
train_data.append(preprocessed_text)
# Save the data intent
train_label.append(intent['intent'])
intent_list.append(intent['intent'])
responses[intent['intent']] = intent["responses"]
# Feature Extraction
vectorizer.fit(train_data)
list_of_words = vectorizer.get_feature_names_out()
train_data_bow = vectorizer.transform(train_data)
# Train the model
clf_dt.fit(train_data_bow, train_label)
:white_check_mark: Milestone 3 (Part 2): Model loading function complete!
:bulb: Running Streamlit To run your Streamlit app, use this command in terminal:
bashstreamlit run chatbot.py
Your browser will automatically open with your chatbot!
if text:
st.write('Chatbot:')
with st.spinner('Loading...'):
st.write(bot_respond(text))
:warning: Common Streamlit Issues
- If you get "No module named streamlit", make sure you installed it with pip
- If the page refreshes when you type, that's normal! Streamlit reruns the entire script
- Use
st.session_state
to maintain conversation history if needed
:white_check_mark: Milestone 4 (Part 2): Congratulations! Your chatbot now has a web interface!
Try to create one more page in the sidebar. You can give the page any name and display information about the programmer of the webpage. You can insert your own image and add any elements that can make your webpage look pretty and interesting.
:bulb: Creative Ideas
- Add a chat history that shows previous conversations
- Include fun animations or GIFs
- Add sound effects when the bot responds
- Create a theme switcher (light/dark mode)
- Add a feedback system where users can rate responses
Before submitting your project, make sure you've completed:
Great job completing this project! You've learned how to combine NLP, Machine Learning, and web development to create a real AI application. Keep experimenting and adding new features to make your chatbot even more impressive! :rocket: