Practice and reinforce the concepts from Lesson 10
💻 Activity Type: Coding Exercise
⏱️ Total Time: 90 minutes
Access the exercise notebook here:
1ZeGjs9Lkn1jd9762NealGxv7mi9Hkx-n💡 Make sure you're signed in to your Google account to access the Colab notebook. Create a copy of the notebook before starting.
Question One: Processing List of Words
⏱️ Time: 15 minutes
Text Preprocessing
- Import the necessary NLTK components
- Import
nltkandSnowballStemmer- Download the Punkt Tokenizer
- Use
nltk.download('punkt')- Initialize a Snowball stemmer for English
- Create stemmer instance:
stemmer = SnowballStemmer('english')- Stem each word in the provided list
- Apply stemmer to each word using a loop or list comprehension
- Store and print the stemmed words
- Save results in a new list and display them tip The Snowball stemmer reduces words to their root form. For example, "running" becomes "run".
Import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizerInitialize the vectorizer
vectorizer = CountVectorizer()Fit the vectorizer with the word list
vectorizer.fit(word_list)Extract and print the vocabulary
vectorizer.vocabulary_Transform the words into a bag-of-words matrix
vectorizer.transform(word_list)Print the resulting matrix
⏱️ Time: 15 minutes
Tokenize the provided sentence
nltk.word_tokenize(sentence)Stem each token using the Snowball stemmer
Join the stemmed tokens back into a sentence
' '.join(stemmed_tokens)Print the processed sentence
Initialize CountVectorizer
Fit with the processed sentence
[processed_sentence]Extract and print vocabulary
Transform sentence to bag-of-words
Print the resulting matrix
💡 CountVectorizer expects a list of documents, so even for a single sentence, wrap it in a list!
Question 3: Process Sentence with Punctuations
⏱️ Time: 20 minutes
Text Processing
- Tokenize the sentence
- Use NLTK's word tokenizer
- Stem each token
- Apply stemmer to all tokens
- Remove non-alphanumeric characters
- Filter tokens using
token.isalnum()- This removes punctuation and special characters
- Join tokens back into a sentence
- Use space as separator
- Print the cleaned sentence
Compare with original to see changes
⚠️ Warning Removing punctuation can change meaning! Consider if this is appropriate for your specific use case.
Feature Extraction
- Initialize CountVectorizer
- Create fresh vectorizer instance
- Fit with cleaned sentence
- Remember to use list format
- Extract vocabulary
- Print vocabulary items
- Transform to bag-of-words
- Apply transformation
- Print the matrix
- Display the result
Advanced Challenges
Challenge One: Processing List of Sentences
⏱️ Time: 20 minutes
- Create empty list for processed sentences
processed_sentences = []- For each sentence:
- Tokenize using
word_tokenize()- Stem tokens with your stemmer
- Remove non-alphanumeric characters using
isalnum()- Join back into sentence with spaces
- Append to results list
- Print processed sentences
- Display original vs. processed for comparison tip Use a helper function to avoid repeating the same preprocessing steps!
Initialize CountVectorizer
max_featuresFit with all processed sentences
Extract vocabulary
Transform all sentences to bag-of-words
Print the matrix
.toarray() to see the dense representation⏱️ Time: 20 minutes
Download and load the emotion dataset
.head()Create text preprocessing function that:
def preprocess_text(text):
# Tokenize
# Stem tokens
# Remove non-alphanumeric characters
# Join back into text
return processed_text
Apply function to entire dataset
.apply() method on the text columnPrint first 5 original and processed sentences
⚠️ Warning Large datasets may take time to process. Consider using a subset for testing first!
Initialize CountVectorizer
Fit with all processed text
Extract vocabulary
Transform dataset to bag-of-words
Print first sample's sparse representation
Convert and print first sample as array
.toarray()[0] to see full vectorNLTK Download Errors
nltk.download('punkt') fails, try:import ssl
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('punkt')
Memory Issues with Large Datasets
max_features parameter in CountVectorizerTfidfVectorizer for better memory efficiencyEmpty Vocabulary
ℹ️ Info
📝 Submission Instructions
Once you've completed all questions and challenges:
- Save your Colab notebook (File -> Save a copy in Drive)
- Generate a shareable link (Share -> Copy link)
- Submit your work through the form below
- Include any observations or challenges you faced
Submission Deadline: Check with your instructor