Demo Mode

Lesson 10 of 18

Concept 10: Text Processing

Text Processing

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand why computers need text to be processed
Apply text preprocessing techniques like tokenization and stemming
Convert text into numerical format using Bag of Words
Use Python's NLTK library for text processing

:information_source: Text Processing is the process of converting text into a form that machines can understand. It transforms human language into numbers that computers can work with.

Before we learn the meaning behind text, we need to help computers recognize and understand our words.

:emoji: Why Computers Need Our Help

Computers can't understand text the way we do:

Words have no meaning to computers - "apple" and "orange" look the same to them
Computers only speak in numbers - They understand 1s and 0s, not letters

:memo: Think of it this way: If you only spoke English and someone wrote to you in Chinese, you'd need a translator. Text processing is like translating human language into computer language!

That's why we need to convert text into numbers before computers can work with it.

:memo: The Two-Step Process

Text processing happens in two main phases:

Text Preprocessing - Cleaning and simplifying the text

Feature Extraction - Converting text into numbers

:emoji: Text Preprocessing

Language is amazing but complex! Many words mean the same thing but look different:

eat (present) and ate (past)

learn (simple) and learning (continuous)

Before training our model, we need to simplify these words so computers can understand them better.

The Four Steps of Text Preprocessing:

Word Tokenization - Breaking sentences into individual words

Word Stemming - Removing word endings to find the root

Removing special characters - Getting rid of punctuation like "?" and "!"

Joining words back - Putting the cleaned words back together

Here's how the whole process flows:

:emoji: Word Tokenization

Word tokenization breaks sentences into individual words called tokens. It's like cutting a sentence with scissors at each space!

Sentence Tokens (Individual Words)

I walked to school yesterday "I", "walked", "to", "school", "yesterday"

He is sleeping "He", "is", "sleeping"

That was amazing "That", "was", "amazing"

tip

Sentence	Tokens (Individual Words)
I walked to school yesterday	"I", "walked", "to", "school", "yesterday"
He is sleeping	"He", "is", "sleeping"
That was amazing	"That", "was", "amazing"
tip

Think of tokenization like breaking a chocolate bar into individual pieces - each piece (token) can be examined separately!

These tokens become the building blocks for the next step.

:emoji:️ Word Stemming

Word stemming removes word endings (suffixes) or beginnings (prefixes) to find the root word. It's like peeling an onion to get to the core!

Original Word	Suffix Removed	Root Word
walked	-ed	walk
sleeping	-ing	sleep
amazing	-ing	amaz

:memo: Sometimes stemming creates words that don't exist (like "amaz"), but that's okay! The computer just needs to recognize that "amazing" and "amaz" mean similar things.

We apply stemming to each token separately:

Original Tokens Stemmed Tokens

"I", "walked", "to", "school", "yesterday" "I", "walk", "to", "school", "yesterday"

"He", "is", "sleeping" "He", "is", "sleep"

"That", "was", "amazing" "That", "was", "amaz"

:emoji: Feature Extraction

Now comes the magic part - turning words into numbers!

Each word in our text becomes a feature (like a characteristic) that helps the computer understand the text.

:emoji: Bag of Words Method

The Bag of Words is one of the simplest ways to convert text into numbers. Here's how it works:

Step One: Create a Vocabulary First, we list all unique words in our text:

Step 2: Count Word Appearances Next, we count how many times each word appears in each sentence:

Step 3: Create Number Vectors Finally, we represent each sentence as a list of numbers:

Sentence Bag of Words Vector What it means

This beautiful day. [1, 1, 1, 0, 0] Has "this", "beautiful", "day"

This beautiful night. [1, 1, 0, 1, 0] Has "this", "beautiful", "night"

This beautiful evening. [1, 1, 0, 0, 1] Has "this", "beautiful", "evening"

tip

Original Tokens	Stemmed Tokens
"I", "walked", "to", "school", "yesterday"	"I", "walk", "to", "school", "yesterday"
"He", "is", "sleeping"	"He", "is", "sleep"
"That", "was", "amazing"	"That", "was", "amaz"

Sentence	Bag of Words Vector	What it means
This beautiful day.	[1, 1, 1, 0, 0]	Has "this", "beautiful", "day"
This beautiful night.	[1, 1, 0, 1, 0]	Has "this", "beautiful", "night"
This beautiful evening.	[1, 1, 0, 0, 1]	Has "this", "beautiful", "evening"
tip

The numbers show how many times each word from our vocabulary appears. A 0 means the word isn't in that sentence, while 1 means it appears once!

Success! We've converted text into numbers that computers can understand. This numerical format is called a bag of words.

To learn more about bag of words, you can refer to the website: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

:emoji: Natural Language Toolkit (NLTK)

:information_source: Natural Language Toolkit (NLTK) is a Python package that helps us process text. It's like a toolbox full of text-processing tools!

Let's learn how to use NLTK to perform all our text preprocessing steps.

:package: Installing and Importing NLTK

Installation Steps:

For Google Colab users:

Good news! NLTK is already installed. Skip to importing!

For VSCode users:

Open your command prompt and run:

bash

pip install nltk

Or try:

bash

py -m pip install nltk

For MacOS users:

Use this command:

bash

python3 -m pip install nltk

Import the Package:

python

import nltk

:emoji: Word Tokenization with NLTK

First, download the Punkt Tokenizer (a special tool for breaking sentences):

python

nltk.download('punkt')

Now you can tokenize any sentence:

python

tokens = nltk.word_tokenize('He is sleeping')

Expected output:

text

['He','is','sleeping']

:bulb: The sentence is now split into individual words - perfect for the next step!

:emoji:️ Word Stemming with NLTK

Import the SnowballStemmer (our word-trimming tool):
python
from nltk.stem import snowball
Create a stemmer for English words:
python
snowballStemmer = snowball.SnowballStemmer("english")
Now stem your tokens (make them lowercase first for consistency):
python
stem_tokens = []
for token in tokens:
    stem_tokens.append(snowballStemmer.stem(token.lower()))
note We use .lower() to make all words lowercase. This ensures "Walking" and "walking" are treated as the same word!

:emoji: Removing Punctuation and Special Characters

We need to clean up our tokens by removing punctuation marks and special characters:

Two simple steps:

Loop through each stemmed token
Keep only letters and numbers (remove everything else)

python

for token in stem_tokens:
    if not token.isalnum():
        stem_tokens.remove(token)

:bulb: The .isalnum() method checks if a token contains only letters and numbers. If it contains punctuation like "!" or "?", we remove it!

:link: Joining Words Back Together

Finally, put all the cleaned tokens back into a sentence:
python
original_sentence = " ".join(stem_tokens)
This creates a clean, processed sentence that's ready for the computer to understand!

:dart: Implementing Bag of Words with CountVectorizer

Scikit-Learn gives us a powerful tool called CountVectorizer that automatically converts sentences into bag of words!

:information_source: CountVectorizer does all the hard work of creating vocabularies and converting text to numbers for us!

Getting Started with CountVectorizer

Import and create the vectorizer:
python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
Building a Vocabulary

The vectorizer needs to learn what words exist in your data:
python
vectorizer.fit(train_data)
For single sentences or word lists, use array format:
python
vectorizer.fit(['He is sleeping'])
python
vectorizer.fit(['He', 'is', 'sleeping'])
Viewing the Vocabulary

See all the words the vectorizer learned:
python
vectorizer.get_feature_names_out()
Converting Text to Numbers

Transform your text into bag of words vectors:
python
# Transform training and testing data
train_data_bow = vectorizer.transform(train_data)
test_data_bow = vectorizer.transform(test_data)
For single sentences:
python
sentence_bow = vectorizer.transform(['He is sleeping'])
Viewing the Results

See the actual numbers by converting to an array:
python
print(train_data_bow.toarray())
note Remember: 'bow' stands for 'bag of words' - it's our text converted into numbers!

:memo: Summary

Let's recap what we've learned about text processing:

Why Text Processing? - Computers only understand numbers, not words
Text Preprocessing - We clean and simplify text through:
- Tokenization (splitting into words)
- Stemming (removing word endings)
- Removing special characters
Feature Extraction - We convert text to numbers using Bag of Words
Python Tools - NLTK helps with preprocessing, CountVectorizer creates bag of words

Now you're ready to help computers understand human language!

:movie_camera: Video

:emoji: AI Prompt

Code with AI: Try using AI to explore text processing methods.

Prompts:

"Write Python code to perform text preprocessing (tokenization, stemming, lemmatization)."
"How can I implement sentiment analysis on a text dataset?"

:pencil2: Practice Time!

Test your understanding with these activities:

Quick Check: Why do we need to convert text into numbers for computers?
Hands-On Practice:
- Take the sentence "The students are learning programming"
- Manually tokenize it (split into words)
- Try to stem each word (remove endings like -ing, -s)
Coding Challenge:
- Use NLTK to tokenize your favorite quote
- Apply stemming to see how the words change
- Create a bag of words representation
Think About It: How might text processing help with:
- Spam email detection?
- Language translation apps?
- Voice assistants like Siri or Alexa?

Remember: Practice makes perfect! Try processing different types of text to see how the techniques work.

Lesson 10 of 18

Concept 10: Text Processing

Text Processing

:dart: Learning Objectives

By the end of this lesson, you will be able to:

Understand why computers need text to be processed
Apply text preprocessing techniques like tokenization and stemming
Convert text into numerical format using Bag of Words
Use Python's NLTK library for text processing

:information_source: Text Processing is the process of converting text into a form that machines can understand. It transforms human language into numbers that computers can work with.

Before we learn the meaning behind text, we need to help computers recognize and understand our words.

:emoji: Why Computers Need Our Help

Computers can't understand text the way we do:

Words have no meaning to computers - "apple" and "orange" look the same to them
Computers only speak in numbers - They understand 1s and 0s, not letters

:memo: Think of it this way: If you only spoke English and someone wrote to you in Chinese, you'd need a translator. Text processing is like translating human language into computer language!

That's why we need to convert text into numbers before computers can work with it.

:memo: The Two-Step Process

Text processing happens in two main phases:

Text Preprocessing - Cleaning and simplifying the text

Feature Extraction - Converting text into numbers

:emoji: Text Preprocessing

Language is amazing but complex! Many words mean the same thing but look different:

eat (present) and ate (past)

learn (simple) and learning (continuous)

Before training our model, we need to simplify these words so computers can understand them better.

The Four Steps of Text Preprocessing:

Word Tokenization - Breaking sentences into individual words

Word Stemming - Removing word endings to find the root

Removing special characters - Getting rid of punctuation like "?" and "!"

Joining words back - Putting the cleaned words back together

Here's how the whole process flows:

:emoji: Word Tokenization

Word tokenization breaks sentences into individual words called tokens. It's like cutting a sentence with scissors at each space!

Sentence Tokens (Individual Words)

I walked to school yesterday "I", "walked", "to", "school", "yesterday"

He is sleeping "He", "is", "sleeping"

That was amazing "That", "was", "amazing"

tip

Sentence	Tokens (Individual Words)
I walked to school yesterday	"I", "walked", "to", "school", "yesterday"
He is sleeping	"He", "is", "sleeping"
That was amazing	"That", "was", "amazing"
tip

Think of tokenization like breaking a chocolate bar into individual pieces - each piece (token) can be examined separately!

These tokens become the building blocks for the next step.

:emoji:️ Word Stemming

Word stemming removes word endings (suffixes) or beginnings (prefixes) to find the root word. It's like peeling an onion to get to the core!

Original Word	Suffix Removed	Root Word
walked	-ed	walk
sleeping	-ing	sleep
amazing	-ing	amaz

:memo: Sometimes stemming creates words that don't exist (like "amaz"), but that's okay! The computer just needs to recognize that "amazing" and "amaz" mean similar things.

We apply stemming to each token separately:

Original Tokens Stemmed Tokens

"I", "walked", "to", "school", "yesterday" "I", "walk", "to", "school", "yesterday"

"He", "is", "sleeping" "He", "is", "sleep"

"That", "was", "amazing" "That", "was", "amaz"

:emoji: Feature Extraction

Now comes the magic part - turning words into numbers!

Each word in our text becomes a feature (like a characteristic) that helps the computer understand the text.

:emoji: Bag of Words Method

The Bag of Words is one of the simplest ways to convert text into numbers. Here's how it works:

Step One: Create a Vocabulary First, we list all unique words in our text:

Step 2: Count Word Appearances Next, we count how many times each word appears in each sentence:

Step 3: Create Number Vectors Finally, we represent each sentence as a list of numbers:

Sentence Bag of Words Vector What it means

This beautiful day. [1, 1, 1, 0, 0] Has "this", "beautiful", "day"

This beautiful night. [1, 1, 0, 1, 0] Has "this", "beautiful", "night"

This beautiful evening. [1, 1, 0, 0, 1] Has "this", "beautiful", "evening"

tip

Original Tokens	Stemmed Tokens
"I", "walked", "to", "school", "yesterday"	"I", "walk", "to", "school", "yesterday"
"He", "is", "sleeping"	"He", "is", "sleep"
"That", "was", "amazing"	"That", "was", "amaz"

Sentence	Bag of Words Vector	What it means
This beautiful day.	[1, 1, 1, 0, 0]	Has "this", "beautiful", "day"
This beautiful night.	[1, 1, 0, 1, 0]	Has "this", "beautiful", "night"
This beautiful evening.	[1, 1, 0, 0, 1]	Has "this", "beautiful", "evening"
tip

The numbers show how many times each word from our vocabulary appears. A 0 means the word isn't in that sentence, while 1 means it appears once!

Success! We've converted text into numbers that computers can understand. This numerical format is called a bag of words.

To learn more about bag of words, you can refer to the website: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

:emoji: Natural Language Toolkit (NLTK)

:information_source: Natural Language Toolkit (NLTK) is a Python package that helps us process text. It's like a toolbox full of text-processing tools!

Let's learn how to use NLTK to perform all our text preprocessing steps.

:package: Installing and Importing NLTK

Installation Steps:

For Google Colab users:

Good news! NLTK is already installed. Skip to importing!

For VSCode users:

Open your command prompt and run:

bash

pip install nltk

Or try:

bash

py -m pip install nltk

For MacOS users:

Use this command:

bash

python3 -m pip install nltk

Import the Package:

python

import nltk

:emoji: Word Tokenization with NLTK

First, download the Punkt Tokenizer (a special tool for breaking sentences):

python

nltk.download('punkt')

Now you can tokenize any sentence:

python

tokens = nltk.word_tokenize('He is sleeping')

Expected output:

text

['He','is','sleeping']

:bulb: The sentence is now split into individual words - perfect for the next step!

:emoji:️ Word Stemming with NLTK

Import the SnowballStemmer (our word-trimming tool):
python
from nltk.stem import snowball
Create a stemmer for English words:
python
snowballStemmer = snowball.SnowballStemmer("english")
Now stem your tokens (make them lowercase first for consistency):
python
stem_tokens = []
for token in tokens:
    stem_tokens.append(snowballStemmer.stem(token.lower()))
note We use .lower() to make all words lowercase. This ensures "Walking" and "walking" are treated as the same word!

:emoji: Removing Punctuation and Special Characters

We need to clean up our tokens by removing punctuation marks and special characters:

Two simple steps:

Loop through each stemmed token
Keep only letters and numbers (remove everything else)

python

for token in stem_tokens:
    if not token.isalnum():
        stem_tokens.remove(token)

:bulb: The .isalnum() method checks if a token contains only letters and numbers. If it contains punctuation like "!" or "?", we remove it!

:link: Joining Words Back Together

Finally, put all the cleaned tokens back into a sentence:
python
original_sentence = " ".join(stem_tokens)
This creates a clean, processed sentence that's ready for the computer to understand!

:dart: Implementing Bag of Words with CountVectorizer

Scikit-Learn gives us a powerful tool called CountVectorizer that automatically converts sentences into bag of words!

:information_source: CountVectorizer does all the hard work of creating vocabularies and converting text to numbers for us!

Getting Started with CountVectorizer

Import and create the vectorizer:
python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
Building a Vocabulary

The vectorizer needs to learn what words exist in your data:
python
vectorizer.fit(train_data)
For single sentences or word lists, use array format:
python
vectorizer.fit(['He is sleeping'])
python
vectorizer.fit(['He', 'is', 'sleeping'])
Viewing the Vocabulary

See all the words the vectorizer learned:
python
vectorizer.get_feature_names_out()
Converting Text to Numbers

Transform your text into bag of words vectors:
python
# Transform training and testing data
train_data_bow = vectorizer.transform(train_data)
test_data_bow = vectorizer.transform(test_data)
For single sentences:
python
sentence_bow = vectorizer.transform(['He is sleeping'])
Viewing the Results

See the actual numbers by converting to an array:
python
print(train_data_bow.toarray())
note Remember: 'bow' stands for 'bag of words' - it's our text converted into numbers!

:memo: Summary

Let's recap what we've learned about text processing:

Why Text Processing? - Computers only understand numbers, not words
Text Preprocessing - We clean and simplify text through:
- Tokenization (splitting into words)
- Stemming (removing word endings)
- Removing special characters
Feature Extraction - We convert text to numbers using Bag of Words
Python Tools - NLTK helps with preprocessing, CountVectorizer creates bag of words

Now you're ready to help computers understand human language!

:movie_camera: Video

:emoji: AI Prompt

Code with AI: Try using AI to explore text processing methods.

Prompts:

"Write Python code to perform text preprocessing (tokenization, stemming, lemmatization)."
"How can I implement sentiment analysis on a text dataset?"

:pencil2: Practice Time!

Test your understanding with these activities:

Quick Check: Why do we need to convert text into numbers for computers?
Hands-On Practice:
- Take the sentence "The students are learning programming"
- Manually tokenize it (split into words)
- Try to stem each word (remove endings like -ing, -s)
Coding Challenge:
- Use NLTK to tokenize your favorite quote
- Apply stemming to see how the words change
- Create a bag of words representation
Think About It: How might text processing help with:
- Spam email detection?
- Language translation apps?
- Voice assistants like Siri or Alexa?

Remember: Practice makes perfect! Try processing different types of text to see how the techniques work.