By the end of this lesson, you will be able to:
:information_source: Text Processing is the process of converting text into a form that machines can understand. It transforms human language into numbers that computers can work with.
Before we learn the meaning behind text, we need to help computers recognize and understand our words.
Computers can't understand text the way we do:
:memo: Think of it this way: If you only spoke English and someone wrote to you in Chinese, you'd need a translator. Text processing is like translating human language into computer language!
That's why we need to convert text into numbers before computers can work with it.
:memo: The Two-Step Process
Text processing happens in two main phases:
- Text Preprocessing - Cleaning and simplifying the text
- Feature Extraction - Converting text into numbers
:emoji: Text Preprocessing
Language is amazing but complex! Many words mean the same thing but look different:
- eat (present) and ate (past)
- learn (simple) and learning (continuous)
Before training our model, we need to simplify these words so computers can understand them better.
The Four Steps of Text Preprocessing:
- Word Tokenization - Breaking sentences into individual words
- Word Stemming - Removing word endings to find the root
- Removing special characters - Getting rid of punctuation like "?" and "!"
- Joining words back - Putting the cleaned words back together
Here's how the whole process flows:
:emoji: Word Tokenization
Word tokenization breaks sentences into individual words called tokens. It's like cutting a sentence with scissors at each space!
Sentence Tokens (Individual Words) I walked to school yesterday "I", "walked", "to", "school", "yesterday" He is sleeping "He", "is", "sleeping" That was amazing "That", "was", "amazing" tip
Think of tokenization like breaking a chocolate bar into individual pieces - each piece (token) can be examined separately!
These tokens become the building blocks for the next step.
Word stemming removes word endings (suffixes) or beginnings (prefixes) to find the root word. It's like peeling an onion to get to the core!
Original Word | Suffix Removed | Root Word |
---|---|---|
walked | -ed | walk |
sleeping | -ing | sleep |
amazing | -ing | amaz |
:memo: Sometimes stemming creates words that don't exist (like "amaz"), but that's okay! The computer just needs to recognize that "amazing" and "amaz" mean similar things.
We apply stemming to each token separately:
Original Tokens Stemmed Tokens "I", "walked", "to", "school", "yesterday" "I", "walk", "to", "school", "yesterday" "He", "is", "sleeping" "He", "is", "sleep" "That", "was", "amazing" "That", "was", "amaz" :emoji: Feature Extraction
Now comes the magic part - turning words into numbers!
Each word in our text becomes a feature (like a characteristic) that helps the computer understand the text.
:emoji: Bag of Words Method
The Bag of Words is one of the simplest ways to convert text into numbers. Here's how it works:
Step One: Create a Vocabulary First, we list all unique words in our text:
Step 2: Count Word Appearances Next, we count how many times each word appears in each sentence:
Step 3: Create Number Vectors Finally, we represent each sentence as a list of numbers:
Sentence Bag of Words Vector What it means This beautiful day. [1, 1, 1, 0, 0] Has "this", "beautiful", "day" This beautiful night. [1, 1, 0, 1, 0] Has "this", "beautiful", "night" This beautiful evening. [1, 1, 0, 0, 1] Has "this", "beautiful", "evening" tip
The numbers show how many times each word from our vocabulary appears. A 0 means the word isn't in that sentence, while 1 means it appears once!
Success! We've converted text into numbers that computers can understand. This numerical format is called a bag of words.
To learn more about bag of words, you can refer to the website: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
:information_source: Natural Language Toolkit (NLTK) is a Python package that helps us process text. It's like a toolbox full of text-processing tools!
Let's learn how to use NLTK to perform all our text preprocessing steps.
For Google Colab users:
For VSCode users:
pip install nltk
py -m pip install nltk
For MacOS users:
python3 -m pip install nltk
import nltk
First, download the Punkt Tokenizer (a special tool for breaking sentences):
nltk.download('punkt')
Now you can tokenize any sentence:
tokens = nltk.word_tokenize('He is sleeping')
Expected output:
['He','is','sleeping']
:bulb: The sentence is now split into individual words - perfect for the next step!
:emoji:️ Word Stemming with NLTK
Import the SnowballStemmer (our word-trimming tool):
python
from nltk.stem import snowball
Create a stemmer for English words:
python
snowballStemmer = snowball.SnowballStemmer("english")
Now stem your tokens (make them lowercase first for consistency):
python
stem_tokens = [] for token in tokens: stem_tokens.append(snowballStemmer.stem(token.lower()))
note We use
.lower()
to make all words lowercase. This ensures "Walking" and "walking" are treated as the same word!
We need to clean up our tokens by removing punctuation marks and special characters:
Two simple steps:
for token in stem_tokens:
if not token.isalnum():
stem_tokens.remove(token)
:bulb: The
.isalnum()
method checks if a token contains only letters and numbers. If it contains punctuation like "!" or "?", we remove it!
:link: Joining Words Back Together
Finally, put all the cleaned tokens back into a sentence:
python
original_sentence = " ".join(stem_tokens)
This creates a clean, processed sentence that's ready for the computer to understand!
:dart: Implementing Bag of Words with CountVectorizer
Scikit-Learn gives us a powerful tool called CountVectorizer that automatically converts sentences into bag of words!
:information_source: CountVectorizer does all the hard work of creating vocabularies and converting text to numbers for us!
Getting Started with CountVectorizer
Import and create the vectorizer:
python
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()
Building a Vocabulary
The vectorizer needs to learn what words exist in your data:
pythonvectorizer.fit(train_data)
For single sentences or word lists, use array format:
python
vectorizer.fit(['He is sleeping'])
python
vectorizer.fit(['He', 'is', 'sleeping'])
Viewing the Vocabulary
See all the words the vectorizer learned:
pythonvectorizer.get_feature_names_out()
Converting Text to Numbers
Transform your text into bag of words vectors:
python
# Transform training and testing data train_data_bow = vectorizer.transform(train_data) test_data_bow = vectorizer.transform(test_data)
For single sentences:
python
sentence_bow = vectorizer.transform(['He is sleeping'])
Viewing the Results
See the actual numbers by converting to an array:
python
print(train_data_bow.toarray())
note Remember: 'bow' stands for 'bag of words' - it's our text converted into numbers!
Let's recap what we've learned about text processing:
Now you're ready to help computers understand human language!
Code with AI: Try using AI to explore text processing methods.
Prompts:
Test your understanding with these activities:
Quick Check: Why do we need to convert text into numbers for computers?
Hands-On Practice:
Coding Challenge:
Think About It: How might text processing help with:
Remember: Practice makes perfect! Try processing different types of text to see how the techniques work.