Lab: Tokenization and Pre-processing#

In today’s lab lecture, we will explore the concepts of text tokenization and pre-processing in the field of Natural Language Processing (NLP). We will delve into the process and importance of these concepts, as well as practical examples in different languages.

The lecture is divided into three main sections:

  1. Understanding Tokenization

  2. Different Types of Tokenization with Examples

  3. Pre-processing Steps in NLP

  4. Tokenization in Different Languages with Examples

Part 1: Understanding Tokenization#

Tokenization is a crucial initial step in NLP, where text data is divided into smaller pieces, or “tokens”. These tokens typically represent words, but can also be sentences, phrases, or even individual characters.

Why is tokenization essential?

Without tokenization, computers would find it hard to comprehend text data as they process information in a structured manner. Tokenization allows us to convert unstructured text into a format more suitable for machine processing.

Part 2: Different Types of Tokenization with Examples#

Depending on the specific requirements of an NLP task, tokenization can occur at different levels:

Word Tokenization#

Word tokenization involves breaking the text into individual words. Let’s look at an example using Python’s NLTK library:

import nltk

nltk.download("punkt")

text = "Welcome to the world of Natural Language Processing!"
tokens = nltk.word_tokenize(text)
print(tokens)
['Welcome', 'to', 'the', 'world', 'of', 'Natural', 'Language', 'Processing', '!']
[nltk_data] Downloading package punkt to /home/yjlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Sentence Tokenization#

To implement sentence segmentation, we’ll use both PySBD (Python Sentence Boundary Disambiguation) and NLTK’s sent_tokenize function. Afterward, we’ll compare the results from these two approaches using the provided text.

Let’s first install the necessary libraries if they haven’t been installed yet:

%pip install pysbd nltk

Now, let’s start with the sentence segmentation:

import pysbd
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

text = """
For strains harboring the pYV plasmid and Yop-encoding plasmids, bacteria were grown with
aeration at 26 °C overnight in broth supplemented with 2.5 mm CaCl2 and 100 μg/ml ampicillin and
then subcultured and grown at 26 °C until A600 of 0.2. At this point, the cultures were shifted to
37 °C and aerated for 1 h. A multiplicity of infection of 50:1 was used for YPIII(p-) incubations,
and a multiplicity of infection of 25:1 was used for other derivatives. For the pYopE-expressing
plasmid, 0.1 mm isopropyl-β-d-thiogalactopyranoside was supplemented during infection to induce
YopE expression.
"""
text = text.strip().replace("\n", " ")

# Sentence segmentation using NLTK
sentences_nltk = sent_tokenize(text)
print("Sentences according to NLTK:")
for i, sent in enumerate(sentences_nltk):
    print(f"{i+1}. {sent}\n")

# Sentence segmentation using PySBD
seg = pysbd.Segmenter(language="en", clean=True)
sentences_pysbd = seg.segment(text)
print("Sentences according to PySBD:")
for i, sent in enumerate(sentences_pysbd):
    print(f"{i+1}. {sent}\n")
Sentences according to NLTK:
1. For strains harboring the pYV plasmid and Yop-encoding plasmids, bacteria were grown with aeration at 26 °C overnight in broth supplemented with 2.5 mm CaCl2 and 100 μg/ml ampicillin and then subcultured and grown at 26 °C until A600 of 0.2.

2. At this point, the cultures were shifted to 37 °C and aerated for 1 h. A multiplicity of infection of 50:1 was used for YPIII(p-) incubations, and a multiplicity of infection of 25:1 was used for other derivatives.

3. For the pYopE-expressing plasmid, 0.1 mm isopropyl-β-d-thiogalactopyranoside was supplemented during infection to induce YopE expression.

Sentences according to PySBD:
1. For strains harboring the pYV plasmid and Yop-encoding plasmids, bacteria were grown with aeration at 26 °C overnight in broth supplemented with 2.5 mm CaCl2 and 100 μg/ml ampicillin and then subcultured and grown at 26 °C until A600 of 0.2.

2. At this point, the cultures were shifted to 37 °C and aerated for 1 h.

3. A multiplicity of infection of 50:1 was used for YPIII(p-) incubations, and a multiplicity of infection of 25:1 was used for other derivatives.

4. For the pYopE-expressing plasmid, 0.1 mm isopropyl-β-d-thiogalactopyranoside was supplemented during infection to induce YopE expression.
[nltk_data] Downloading package punkt to /home/yjlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

In this snippet, we first tokenize the sentences using NLTK’s sent_tokenize function and then print each sentence. We then do the same using PySBD’s segment method. The enumerate function is used to display sentence numbers. Please examine the output from each method and compare them to evaluate their performance.

Part 3: Pre-processing Steps in NLP#

Before tokenization, we often need to clean and standardize our text data. These pre-processing steps include:

  1. Case Normalization: It involves converting all text to the same case (upper or lower) to ensure consistency. For example, ‘text’, ‘Text’, ‘TEXT’ are all converted to ‘text’.

  2. Removing Punctuation: Punctuation can provide meaningful context in certain NLP tasks. However, for many tasks, punctuation is unnecessary and can be removed.

  3. Removing Stop Words: Stop words are commonly used words in a language (e.g., ‘the’, ‘is’, ‘in’). In many NLP tasks, these words are filtered out as they don’t provide meaningful information.

  4. Stemming and Lemmatization: These are methods to reduce inflected (or sometimes derived) words to their root form.

Let’s see an example of these steps in action using NLTK:

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import string

nltk.download("punkt")
nltk.download("stopwords")

text = "The quick brown Fox jumped over the lazy Dog!"

# Case Normalization
text = text.lower()

# Remove punctuation
text = text.translate(str.maketrans("", "", string.punctuation))

# Word Tokenization
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words("english"))
tokens = [i for i in tokens if not i in stop_words]

# Stemming
ps = PorterStemmer()
tokens = [ps.stem(word) for word in tokens]

print(tokens)
['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']
[nltk_data] Downloading package punkt to /home/yjlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/yjlee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In this example, we first normalize the case to lower, remove punctuation, tokenize the words, remove the stop words, and finally perform stemming.

Part 4: Tokenization in Different Languages with Examples#

The tokenization process can be quite different and complex in some languages other than English.

Tokenization in French#

Tokenization in French can be a little more complex due to the frequent use of contractions and accents. Here is an example of tokenizing a French sentence using the NLTK library:

from nltk.tokenize import word_tokenize

text = "C'est une belle journée!"
tokens = word_tokenize(text)
print(tokens)
["C'est", 'une', 'belle', 'journée', '!']

Tokenization in Korean#

Tokenization in Korean can be quite complex because words are often combined into larger morphological units. The Korean NLP library, eKoNLPy, offers MeCab-based morphological parsers. Here is an example using one of them, MeCab:

%pip install eKoNLPy
from ekonlpy.mecab import MeCab

mecab = MeCab()
text = "안녕하세요! 자연어 처리를 배우고 있습니다."
tokens = mecab.pos(text)
print(tokens)
[('안녕', 'NNG'), ('하', 'XSV'), ('세요', 'EP+EF'), ('!', 'SF'), ('자연어', 'NNG'), ('처리', 'NNG'), ('를', 'JKO'), ('배우', 'VV'), ('고', 'EC'), ('있', 'VX'), ('습니다', 'EF'), ('.', 'SF')]

This concludes our lab lecture on tokenization and pre-processing in NLP. Remember, the specifics of these techniques may vary based on the language and the task at hand. Practice with different texts and languages to become more comfortable with these concepts.