Practice and reinforce the concepts from Lesson 10
:computer: Activity Type: Coding Exercise
⏱️ Total Time: 90 minutes
Access the exercise notebook here: Text Processing Colab Notebook
:bulb: Make sure you're signed in to your Google account to access the Colab notebook. Create a copy of the notebook before starting.
Question One: Processing List of Words
⏱️ Time: 15 minutes
Text Preprocessing
- Import the necessary NLTK components
- Import
nltk
andSnowballStemmer
- Download the Punkt Tokenizer
- Use
nltk.download('punkt')
- Initialize a Snowball stemmer for English
- Create stemmer instance:
stemmer = SnowballStemmer('english')
- Stem each word in the provided list
- Apply stemmer to each word using a loop or list comprehension
- Store and print the stemmed words
- Save results in a new list and display them tip The Snowball stemmer reduces words to their root form. For example, "running" becomes "run".
Import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
Initialize the vectorizer
vectorizer = CountVectorizer()
Fit the vectorizer with the word list
vectorizer.fit(word_list)
Extract and print the vocabulary
vectorizer.vocabulary_
Transform the words into a bag-of-words matrix
vectorizer.transform(word_list)
Print the resulting matrix
⏱️ Time: 15 minutes
Tokenize the provided sentence
nltk.word_tokenize(sentence)
Stem each token using the Snowball stemmer
Join the stemmed tokens back into a sentence
' '.join(stemmed_tokens)
Print the processed sentence
Initialize CountVectorizer
Fit with the processed sentence
[processed_sentence]
Extract and print vocabulary
Transform sentence to bag-of-words
Print the resulting matrix
:bulb: CountVectorizer expects a list of documents, so even for a single sentence, wrap it in a list!
Question 3: Process Sentence with Punctuations
⏱️ Time: 20 minutes
Text Processing
- Tokenize the sentence
- Use NLTK's word tokenizer
- Stem each token
- Apply stemmer to all tokens
- Remove non-alphanumeric characters
- Filter tokens using
token.isalnum()
- This removes punctuation and special characters
- Join tokens back into a sentence
- Use space as separator
- Print the cleaned sentence
Compare with original to see changes
:warning: Warning Removing punctuation can change meaning! Consider if this is appropriate for your specific use case.
Feature Extraction
- Initialize CountVectorizer
- Create fresh vectorizer instance
- Fit with cleaned sentence
- Remember to use list format
- Extract vocabulary
- Print vocabulary items
- Transform to bag-of-words
- Apply transformation
- Print the matrix
- Display the result
Advanced Challenges
Challenge One: Processing List of Sentences
⏱️ Time: 20 minutes
- Create empty list for processed sentences
processed_sentences = []
- For each sentence:
- Tokenize using
word_tokenize()
- Stem tokens with your stemmer
- Remove non-alphanumeric characters using
isalnum()
- Join back into sentence with spaces
- Append to results list
- Print processed sentences
- Display original vs. processed for comparison tip Use a helper function to avoid repeating the same preprocessing steps!
Initialize CountVectorizer
max_features
Fit with all processed sentences
Extract vocabulary
Transform all sentences to bag-of-words
Print the matrix
.toarray()
to see the dense representation⏱️ Time: 20 minutes
Download and load the emotion dataset
.head()
Create text preprocessing function that:
def preprocess_text(text):
# Tokenize
# Stem tokens
# Remove non-alphanumeric characters
# Join back into text
return processed_text
Apply function to entire dataset
.apply()
method on the text columnPrint first 5 original and processed sentences
:warning: Warning Large datasets may take time to process. Consider using a subset for testing first!
Initialize CountVectorizer
Fit with all processed text
Extract vocabulary
Transform dataset to bag-of-words
Print first sample's sparse representation
Convert and print first sample as array
.toarray()[0]
to see full vectorNLTK Download Errors
nltk.download('punkt')
fails, try:import ssl
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('punkt')
Memory Issues with Large Datasets
max_features
parameter in CountVectorizerTfidfVectorizer
for better memory efficiencyEmpty Vocabulary
:information_source: Info
:memo: Submission Instructions
Once you've completed all questions and challenges:
- Save your Colab notebook (File -> Save a copy in Drive)
- Generate a shareable link (Share -> Copy link)
- Submit your work through the form below
- Include any observations or challenges you faced
Submission Deadline: Check with your instructor