Understanding the Basics#

What is Tokenization?#

Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or other meaningful elements called tokens. These tokens are the fundamental units of a text, and they play a pivotal role in natural language processing (NLP).

Here’s what tokenization involves:

Input:#

The input for tokenization is a set of documents (text files), represented as \(D\).

Output:#

The output of tokenization is twofold:

  1. Tokens: A sequence, \(W\) , containing a list of tokens – words or word pieces for NLP.

  2. N-grams: A matrix, \(X\), containing statistics about word/phrase frequencies in the documents.

Token Examples:#

Tokens can be of various types:

  • Characters: Documents viewed as a sequence of individual letters. For example, “hello world” can be broken down into {h, e, l, l, o, , w, o, r, l, d}.

  • Words: Documents split based on whitespace. For example, “hello world” becomes {hello, world}.

  • N-grams: Documents tokenized into phrases. For example, “hello world” becomes {hello_world}.

Goals of Tokenization#

The main goals of tokenization in NLP are to create features that are:

  • Predictive: They should contribute meaningfully to the learning task.

  • Interpretable: Humans should be able to understand what the features represent.

  • Tractable: They should be easy to work with computationally.

There are two primary approaches to achieving these goals:

  1. Convert documents to vectors, typically frequency distributions over pre-processed n-grams.

  2. Convert documents to sequences of tokens, suitable as inputs to sequential models like RNNs or Transformer models.

A Traditional Tokenization Pipeline#

The typical tokenization process involves several steps:

  1. Extract text from documents (e.g., PDF, HTML, XML).

  2. Tokenize the extracted text into words.

  3. Normalize the words (e.g., lowercasing, stemming, lemmatization).

  4. Remove stop words (common words like “the”, “a”, “an”, “in”).

  5. Build a vocabulary of words.

  6. Convert the documents into vectors of word counts or TF-IDF scores.

  7. Train a machine learning model on these vectors.

Subword Tokenization for Sequence Models#

Modern transformer models (like BERT and GPT) use a method known as subword tokenization:

  • They construct character-level n-grams and treat whitespace the same as other characters.

  • All letters are converted to lowercase, but a special character is added to indicate capitalization.

For example, BERT’s WordPiece tokenizer breaks down words like “playing” into “play” and “##ing”. It uses a character-level byte-pair encoder and is limited to a fixed vocabulary size (e.g., BERT uses 30K).

Segmenting Text into Sentences and Paragraphs#

Segmentation of text into sentences and paragraphs is also an important part of tokenization:

  • Libraries like spaCy and pySBD provide effective sentence splitting capabilities.

  • Paragraph segmentation is often based on annotated new paragraphs or line breaks in most corpora.

Pre-processing: The Art of Text Analysis#

Pre-processing involves deciding which data to keep and which to discard:

  • Uninformative data can add noise, reduce statistical precision, and increase computational cost.

  • Pre-processing choices can significantly impact the results, especially in unsupervised learning tasks.

Some Common Pre-processing Techniques#

Capitalization#

Standard corpus normalization often involves removing capitalization as it’s usually not informative. However, for some tasks, capitalization is crucial (e.g., sentence splitting, part-of-speech tagging, syntactic parsing, and semantic role labeling). For sequence data, such as language modeling, tokenizer packages (like Huggingface’s tokenizer) remove capitalization but then add a special “capitalized” token before the word.

Punctuation#

Inclusion of punctuation depends on your task. For vectorizing the document as a bag of words or n-grams, punctuation is usually not necessary. But, similar to capitalization, punctuation is needed for tasks like sentence splitting, part-of-speech tagging, syntax parsing, and role labeling.

Numbers#

For tasks like document classification using a bag of words, numbers can be dropped or replaced with special characters. But for language models, numbers are treated just like letters. For instance, GPT-3 can even solve basic math problems.

Stopwords#

Stopwords are words so common that they don’t carry much information. Depending on your task, you can drop stopwords or keep them as part of phrases. You can also filter out words and phrases using part-of-speech tags.

Stemming/Lemmatizing#

Stemming (reducing words to their root form) and lemmatization (reducing words to their base or dictionary form) can help reduce the dimensions of the data with minimal loss of information. For example, the phrase “I am running” can be transformed to “I am run” using lemmatization.

These techniques are part of the “art” of text analysis and require careful consideration as they can significantly impact the performance of the downstream tasks.

Conclusion#

Tokenization is a crucial step in NLP and can heavily influence the success of your text analytics project. By understanding these principles and approaches, you can ensure that your text data is processed effectively, leading to more accurate and insightful outcomes. Remember that while there are standard practices, the best approach often depends on the specifics of your task and the characteristics of your data.