Activity 10: Text Processing & NLP - Discovery Challenge

🎯 Learning Objectives

By the end of this activity, you will:

Master text preprocessing techniques (tokenization, stemming)
Apply Natural Language Processing (NLP) fundamentals
Implement Bag-of-Words (BoW) feature extraction
Handle punctuation and special characters in text
Process real-world text datasets at scale
Build complete NLP pipelines from raw text to vectors

🚀 Getting Started (See Results in 30 Seconds!)

Option One: Start with Complete Baseline (Recommended)

Open v1-baseline-100percent.ipynb in Google Colab
Run all cells (Runtime -> Run all)
Explore: See complete NLP pipeline with working examples

Option 2: Jump into the Challenge

Open activity-10-text-processing.ipynb in Google Colab
Start working through the TODOs
Reference v1-baseline-100percent.ipynb when stuck

📋 What's Already Working

✅ Core NLP Setup

NLTK library configuration
Stemming demonstrations (SnowballStemmer)
First examples of each technique

✅ Working Examples (24 cells ready to run):

Import statements and dependencies
Sample text data creation
Tokenization demonstrations
Stemming examples
Punctuation removal patterns
Dataset loading and processing framework

📝 Your Tasks (11 TODOs to Complete)

Part One: Processing List of Words (3 TODOs - Easy)

TODO	Task	Difficulty	Key Concepts
1	Import and Initialize CountVectorizer	Easy	scikit-learn text tools
2	List Vocabulary Features	Easy	`get_feature_names_out()`
3	Transform to Bag-of-Words	Easy	`.transform()`, `.toarray()`

Learning Focus: Master the Bag-of-Words representation for text data.

Part 2: Processing Sentences (3 TODOs - Medium)

TODO	Task	Difficulty	Key Concepts
4	Setup Text Processing Tools	Medium	NLTK initialization
5	Tokenize and Stem Sentence	Medium	Multi-step pipeline
6	Display Sentence Vocabulary	Medium	Feature extraction

Learning Focus: Build a complete preprocessing pipeline for sentences.

Part 3: Handling Punctuation (2 TODOs - Medium)

TODO	Task	Difficulty	Key Concepts
7	Remove Punctuation and Stem	Medium	`isalnum()` filtering
8	Extract Clean Vocabulary	Medium	Clean text features

Learning Focus: Handle real-world text with special characters and noise.

Part 4: Processing Multiple Sentences (2 TODOs - Hard)

TODO	Task	Difficulty	Key Concepts
9	Display Preprocessed Sentences	Hard	Batch processing
10	Show Combined Vocabulary	Hard	Corpus-level features

Learning Focus: Scale preprocessing to handle document collections.

Part 5: Real-World Dataset (1 TODO - Hard)

TODO	Task	Difficulty	Key Concepts
11	Convert to Array Format	Hard	Sparse matrix handling

Learning Focus: Work with production-scale text datasets (emotion classification).

✅ Success Criteria

You've successfully completed this activity when:

All TODOs Pass: Each TODO produces the expected output
Text is Preprocessed: Tokenization, stemming, and cleaning work correctly
BoW Works: Text is successfully converted to numerical vectors
Vocabulary Extracts: Feature names are correctly identified
Dataset Processes: Large-scale text data is handled efficiently

🔍 Key Concepts Covered

Text Preprocessing Pipeline

python

import nltk
from nltk.stem import snowball
from sklearn.feature_extraction.text import CountVectorizer

# 1. Tokenization
tokens = nltk.word_tokenize(sentence)

# 2. Stemming
stemmer = snowball.SnowballStemmer("english")
stemmed = [stemmer.stem(token) for token in tokens]

# 3. Cleaning
clean = [t for t in stemmed if t.isalnum()]

# 4. Feature Extraction
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([text])

Bag-of-Words Representation

BoW converts text to numerical vectors by:

Building a vocabulary (unique words across all documents)
Counting word occurrences in each document
Representing each document as a vector of word counts

Example:

Vocabulary: ['hello', 'world', 'python']
Sentence: "hello world"
BoW: [1, 1, 0] (1 hello, 1 world, 0 python)

🚀 Extension Challenges

Ready for more? Try these advanced tasks:

TF-IDF: Replace CountVectorizer with TfidfVectorizer and compare results
Lemmatization: Use WordNetLemmatizer instead of SnowballStemmer
N-grams: Extract bigrams and trigrams (e.g., "New York", "San Francisco")
Stop Words: Remove common words like "the", "is", "and"
Sentiment Analysis: Train a classifier on the emotion dataset
Word Cloud: Visualize most frequent words using wordcloud library

📚 Additional Resources

💡 Tips for Success

Run Examples First: See working code before attempting TODOs
Check Outputs: Compare your results with expected outputs
Understand Stemming: Notice how "running" becomes "run"
Use Print Statements: Debug by printing intermediate results
Test Small: Try techniques on single words before scaling to datasets

🆘 Common Issues

Issue	Solution
`LookupError: Resource punkt not found`	Run `nltk.download('punkt')` before tokenizing
Text still has punctuation	Check `isalnum()` logic - remove tokens that fail the test
BoW array is all zeros	Ensure you fit vectorizer before transforming
Vocabulary is empty	Verify text was preprocessed before fitting
"Sparse matrix" error	Use `.toarray()` to convert to dense numpy array

📊 Dataset Information

Emotion Dataset (Challenge 2)

Source: Kaggle Emotions Dataset for NLP
Size: ~400,000 text samples
Classes: 6 emotions (sadness, joy, love, anger, fear, surprise)
Use Case: Sentiment analysis and emotion classification

Sample:

vbnet

Document: "i didnt feel humiliated"
Emotion: sadness

🧪 Validation Checklist

Before submitting, verify:

All TODOs have code (no empty cells)
Expected outputs match your outputs
Vocabulary lists are alphabetically sorted
BoW arrays have correct shape
Punctuation is removed from cleaned text
Large dataset processes without errors

Estimated Time: 75-105 minutes Difficulty: Intermediate to Advanced Prerequisites: Python basics, Pandas, NumPy, basic NLP concepts

Need Help? Review the baseline notebook or consult the instructor!

Template 10: Text Processing

📦 Project Files Included: