Demo Mode

Activity 10 of 18

Activity 10: Text Processing

Practice and reinforce the concepts from Lesson 10

Text Processing Exercise

What you'll learn

How to preprocess text data using NLTK
Text tokenization and stemming techniques
Feature extraction using bag-of-words
Working with real-world text datasets
Handling punctuation and special characters

Introduction

:computer: Activity Type: Coding Exercise

⏱️ Total Time: 90 minutes

Access the exercise notebook here: Text Processing Colab Notebook

:bulb: Make sure you're signed in to your Google account to access the Colab notebook. Create a copy of the notebook before starting.

Question One: Processing List of Words

⏱️ Time: 15 minutes

Text Preprocessing

Import the necessary NLTK components

Import nltk and SnowballStemmer

Download the Punkt Tokenizer

Use nltk.download('punkt')

Initialize a Snowball stemmer for English

Create stemmer instance: stemmer = SnowballStemmer('english')

Stem each word in the provided list

Apply stemmer to each word using a loop or list comprehension

Store and print the stemmed words

Save results in a new list and display them tip The Snowball stemmer reduces words to their root form. For example, "running" becomes "run".

Feature Extraction

Import CountVectorizer from sklearn
- from sklearn.feature_extraction.text import CountVectorizer
Initialize the vectorizer
- vectorizer = CountVectorizer()
Fit the vectorizer with the word list
- Use vectorizer.fit(word_list)
Extract and print the vocabulary
- Access vocabulary using vectorizer.vocabulary_
Transform the words into a bag-of-words matrix
- Use vectorizer.transform(word_list)
Print the resulting matrix
- Display the sparse matrix representation

Question 2: Processing Sentence

⏱️ Time: 15 minutes

Text Preprocessing

Tokenize the provided sentence
- Use nltk.word_tokenize(sentence)
Stem each token using the Snowball stemmer
- Apply stemmer to each token in the list
Join the stemmed tokens back into a sentence
- Use ' '.join(stemmed_tokens)
Print the processed sentence
- Display both original and processed versions

Feature Extraction

Initialize CountVectorizer
- Create new vectorizer instance
Fit with the processed sentence
- Remember to wrap sentence in a list: [processed_sentence]
Extract and print vocabulary
- Display the vocabulary dictionary
Transform sentence to bag-of-words
- Transform the processed sentence
Print the resulting matrix
- Show the sparse matrix representation

:bulb: CountVectorizer expects a list of documents, so even for a single sentence, wrap it in a list!

Question 3: Process Sentence with Punctuations

⏱️ Time: 20 minutes

Text Processing

Tokenize the sentence

Use NLTK's word tokenizer

Stem each token

Apply stemmer to all tokens

Remove non-alphanumeric characters

Filter tokens using token.isalnum()

This removes punctuation and special characters

Join tokens back into a sentence

Use space as separator

Print the cleaned sentence

Compare with original to see changes

:warning: Warning Removing punctuation can change meaning! Consider if this is appropriate for your specific use case.

Feature Extraction

Initialize CountVectorizer

Create fresh vectorizer instance

Fit with cleaned sentence

Remember to use list format

Extract vocabulary

Print vocabulary items

Transform to bag-of-words

Apply transformation

Print the matrix

Display the result

Advanced Challenges

Challenge One: Processing List of Sentences

⏱️ Time: 20 minutes

Create empty list for processed sentences

processed_sentences = []

For each sentence:

Tokenize using word_tokenize()

Stem tokens with your stemmer

Remove non-alphanumeric characters using isalnum()

Join back into sentence with spaces

Append to results list

Print processed sentences

Display original vs. processed for comparison tip Use a helper function to avoid repeating the same preprocessing steps!

Feature Extraction

Initialize CountVectorizer
- Consider setting parameters like max_features
Fit with all processed sentences
- Use the entire list of processed sentences
Extract vocabulary
- Check vocabulary size
Transform all sentences to bag-of-words
- This creates a matrix with one row per sentence
Print the matrix
- Use .toarray() to see the dense representation

Challenge 2: Processing Dataset

⏱️ Time: 20 minutes

Download and load the emotion dataset
- Use pandas to read the CSV file
- Check dataset structure with .head()

Create text preprocessing function that:

python

def preprocess_text(text):
    # Tokenize
    # Stem tokens
    # Remove non-alphanumeric characters
    # Join back into text
    return processed_text

Apply function to entire dataset
- Use .apply() method on the text column
Print first 5 original and processed sentences
- Compare side by side to verify processing

:warning: Warning Large datasets may take time to process. Consider using a subset for testing first!

Feature Extraction

Initialize CountVectorizer
- Consider limiting vocabulary size for large datasets
Fit with all processed text
- This learns the vocabulary from the entire dataset
Extract vocabulary
- Check total number of unique words
Transform dataset to bag-of-words
- Creates a sparse matrix representation
Print first sample's sparse representation
- Shows non-zero elements only
Convert and print first sample as array
- Use .toarray()[0] to see full vector

Troubleshooting

Common Issues:

NLTK Download Errors

If nltk.download('punkt') fails, try:

python

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('punkt')

Memory Issues with Large Datasets
- Use max_features parameter in CountVectorizer
- Process data in batches
- Consider using TfidfVectorizer for better memory efficiency
Empty Vocabulary
- Check if text preprocessing is removing all words
- Ensure you're fitting the vectorizer before transforming

Resources

Exercise Submission

:information_source: Info

:memo: Submission Instructions

Once you've completed all questions and challenges:

Save your Colab notebook (File -> Save a copy in Drive)

Generate a shareable link (Share -> Copy link)

Submit your work through the form below

Include any observations or challenges you faced

Submission Deadline: Check with your instructor

Submit Your Exercise Here

Activity 10 of 18

Activity 10: Text Processing

Practice and reinforce the concepts from Lesson 10

Text Processing Exercise

What you'll learn

How to preprocess text data using NLTK
Text tokenization and stemming techniques
Feature extraction using bag-of-words
Working with real-world text datasets
Handling punctuation and special characters

Introduction

:computer: Activity Type: Coding Exercise

⏱️ Total Time: 90 minutes

Access the exercise notebook here: Text Processing Colab Notebook

:bulb: Make sure you're signed in to your Google account to access the Colab notebook. Create a copy of the notebook before starting.

Question One: Processing List of Words

⏱️ Time: 15 minutes

Text Preprocessing

Import the necessary NLTK components

Import nltk and SnowballStemmer

Download the Punkt Tokenizer

Use nltk.download('punkt')

Initialize a Snowball stemmer for English

Create stemmer instance: stemmer = SnowballStemmer('english')

Stem each word in the provided list

Apply stemmer to each word using a loop or list comprehension

Store and print the stemmed words

Save results in a new list and display them tip The Snowball stemmer reduces words to their root form. For example, "running" becomes "run".

Feature Extraction

Import CountVectorizer from sklearn
- from sklearn.feature_extraction.text import CountVectorizer
Initialize the vectorizer
- vectorizer = CountVectorizer()
Fit the vectorizer with the word list
- Use vectorizer.fit(word_list)
Extract and print the vocabulary
- Access vocabulary using vectorizer.vocabulary_
Transform the words into a bag-of-words matrix
- Use vectorizer.transform(word_list)
Print the resulting matrix
- Display the sparse matrix representation

Question 2: Processing Sentence

⏱️ Time: 15 minutes

Text Preprocessing

Tokenize the provided sentence
- Use nltk.word_tokenize(sentence)
Stem each token using the Snowball stemmer
- Apply stemmer to each token in the list
Join the stemmed tokens back into a sentence
- Use ' '.join(stemmed_tokens)
Print the processed sentence
- Display both original and processed versions

Feature Extraction

Initialize CountVectorizer
- Create new vectorizer instance
Fit with the processed sentence
- Remember to wrap sentence in a list: [processed_sentence]
Extract and print vocabulary
- Display the vocabulary dictionary
Transform sentence to bag-of-words
- Transform the processed sentence
Print the resulting matrix
- Show the sparse matrix representation

:bulb: CountVectorizer expects a list of documents, so even for a single sentence, wrap it in a list!

Question 3: Process Sentence with Punctuations

⏱️ Time: 20 minutes

Text Processing

Tokenize the sentence

Use NLTK's word tokenizer

Stem each token

Apply stemmer to all tokens

Remove non-alphanumeric characters

Filter tokens using token.isalnum()

This removes punctuation and special characters

Join tokens back into a sentence

Use space as separator

Print the cleaned sentence

Compare with original to see changes

:warning: Warning Removing punctuation can change meaning! Consider if this is appropriate for your specific use case.

Feature Extraction

Initialize CountVectorizer

Create fresh vectorizer instance

Fit with cleaned sentence

Remember to use list format

Extract vocabulary

Print vocabulary items

Transform to bag-of-words

Apply transformation

Print the matrix

Display the result

Advanced Challenges

Challenge One: Processing List of Sentences

⏱️ Time: 20 minutes

Create empty list for processed sentences

processed_sentences = []

For each sentence:

Tokenize using word_tokenize()

Stem tokens with your stemmer

Remove non-alphanumeric characters using isalnum()

Join back into sentence with spaces

Append to results list

Print processed sentences

Display original vs. processed for comparison tip Use a helper function to avoid repeating the same preprocessing steps!

Feature Extraction

Initialize CountVectorizer
- Consider setting parameters like max_features
Fit with all processed sentences
- Use the entire list of processed sentences
Extract vocabulary
- Check vocabulary size
Transform all sentences to bag-of-words
- This creates a matrix with one row per sentence
Print the matrix
- Use .toarray() to see the dense representation

Challenge 2: Processing Dataset

⏱️ Time: 20 minutes

Download and load the emotion dataset
- Use pandas to read the CSV file
- Check dataset structure with .head()

Create text preprocessing function that:

python

def preprocess_text(text):
    # Tokenize
    # Stem tokens
    # Remove non-alphanumeric characters
    # Join back into text
    return processed_text

Apply function to entire dataset
- Use .apply() method on the text column
Print first 5 original and processed sentences
- Compare side by side to verify processing

:warning: Warning Large datasets may take time to process. Consider using a subset for testing first!

Feature Extraction

Initialize CountVectorizer
- Consider limiting vocabulary size for large datasets
Fit with all processed text
- This learns the vocabulary from the entire dataset
Extract vocabulary
- Check total number of unique words
Transform dataset to bag-of-words
- Creates a sparse matrix representation
Print first sample's sparse representation
- Shows non-zero elements only
Convert and print first sample as array
- Use .toarray()[0] to see full vector

Troubleshooting

Common Issues:

NLTK Download Errors

If nltk.download('punkt') fails, try:

python

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('punkt')

Memory Issues with Large Datasets
- Use max_features parameter in CountVectorizer
- Process data in batches
- Consider using TfidfVectorizer for better memory efficiency
Empty Vocabulary
- Check if text preprocessing is removing all words
- Ensure you're fitting the vectorizer before transforming

Resources

Exercise Submission

:information_source: Info

:memo: Submission Instructions

Once you've completed all questions and challenges:

Save your Colab notebook (File -> Save a copy in Drive)

Generate a shareable link (Share -> Copy link)

Submit your work through the form below

Include any observations or challenges you faced

Submission Deadline: Check with your instructor

Submit Your Exercise Here