Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)๐ก Download the ZIP, extract it, and follow the instructions below to get started!
By the end of this activity, you will:
v1-baseline-100percent.ipynb in Google Colabactivity-10-text-processing.ipynb in Google Colabv1-baseline-100percent.ipynb when stuckโ Core NLP Setup
โ Working Examples (24 cells ready to run):
| TODO | Task | Difficulty | Key Concepts |
|---|---|---|---|
| 1 | Import and Initialize CountVectorizer | Easy | scikit-learn text tools |
| 2 | List Vocabulary Features | Easy | get_feature_names_out() |
| 3 | Transform to Bag-of-Words | Easy | .transform(), .toarray() |
Learning Focus: Master the Bag-of-Words representation for text data.
| TODO | Task | Difficulty | Key Concepts |
|---|---|---|---|
| 4 | Setup Text Processing Tools | Medium | NLTK initialization |
| 5 | Tokenize and Stem Sentence | Medium | Multi-step pipeline |
| 6 | Display Sentence Vocabulary | Medium | Feature extraction |
Learning Focus: Build a complete preprocessing pipeline for sentences.
| TODO | Task | Difficulty | Key Concepts |
|---|---|---|---|
| 7 | Remove Punctuation and Stem | Medium | isalnum() filtering |
| 8 | Extract Clean Vocabulary | Medium | Clean text features |
Learning Focus: Handle real-world text with special characters and noise.
| TODO | Task | Difficulty | Key Concepts |
|---|---|---|---|
| 9 | Display Preprocessed Sentences | Hard | Batch processing |
| 10 | Show Combined Vocabulary | Hard | Corpus-level features |
Learning Focus: Scale preprocessing to handle document collections.
| TODO | Task | Difficulty | Key Concepts |
|---|---|---|---|
| 11 | Convert to Array Format | Hard | Sparse matrix handling |
Learning Focus: Work with production-scale text datasets (emotion classification).
You've successfully completed this activity when:
import nltk
from nltk.stem import snowball
from sklearn.feature_extraction.text import CountVectorizer
# 1. Tokenization
tokens = nltk.word_tokenize(sentence)
# 2. Stemming
stemmer = snowball.SnowballStemmer("english")
stemmed = [stemmer.stem(token) for token in tokens]
# 3. Cleaning
clean = [t for t in stemmed if t.isalnum()]
# 4. Feature Extraction
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([text])
BoW converts text to numerical vectors by:
Example:
['hello', 'world', 'python'][1, 1, 0] (1 hello, 1 world, 0 python)Ready for more? Try these advanced tasks:
wordcloud library| Issue | Solution |
|---|---|
LookupError: Resource punkt not found |
Run nltk.download('punkt') before tokenizing |
| Text still has punctuation | Check isalnum() logic - remove tokens that fail the test |
| BoW array is all zeros | Ensure you fit vectorizer before transforming |
| Vocabulary is empty | Verify text was preprocessed before fitting |
| "Sparse matrix" error | Use .toarray() to convert to dense numpy array |
Sample:
Document: "i didnt feel humiliated"
Emotion: sadness
Before submitting, verify:
Estimated Time: 75-105 minutes Difficulty: Intermediate to Advanced Prerequisites: Python basics, Pandas, NumPy, basic NLP concepts
Need Help? Review the baseline notebook or consult the instructor!