Bags of Words Model#

In the realm of Natural Language Processing, there are several methods and models used to represent and understand text data. One of the simplest yet surprisingly powerful methods is the Bag of Words (BoW) model. This method is simple, intuitive, and has been effectively used in various NLP tasks such as information retrieval, sentiment analysis, and document classification.

What is Bag of Words Model?#

The Bag of Words (BoW) model is a way of representing text data when modeling text with machine learning algorithms. It describes the occurrence of words within a document or a set of documents (corpus). It’s called a “bag” of words because it discards most of the structure of the input text, such as paragraph and sentence boundaries, punctuation, and any order of words, focusing only on the frequency of words in the text.

How does it work?#

Here is a basic step-by-step breakdown of how the BoW model works:

  1. Tokenization: The first step in creating a BoW model is to break down the text into individual words or tokens. This process is known as tokenization.

  2. Vocabulary Creation: After tokenization, a vocabulary of known words (unique words) for a given corpus is created.

  3. Encoding: Finally, each document is encoded as a vector in an n-dimensional space, where n is the size of the created vocabulary. This vector contains the frequency of each word in the document.

Here’s a simple example to illustrate:

Consider the following sentences:

  1. The cat sat on the mat.

  2. The dog sat on the log.

  3. Cats and dogs are mortal enemies.

First, we tokenize each sentence, breaking them down into individual words.

Next, we create a vocabulary of unique words from all sentences:

{the, cat, sat, on, mat, dog, log, cats, and, dogs, are, mortal, enemies}

Then, we create a vector for each sentence, encoding the frequency of each word from our vocabulary:

  1. The cat sat on the mat: [2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

  2. The dog sat on the log: [2, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0]

  3. Cats and dogs are mortal enemies: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

Each vector’s position corresponds to a word in our vocabulary and the value at each index is the count of that word in the sentence.

Limitations of the Bag of Words Model#

Despite its simplicity, the BoW model has several limitations:

  1. Semantic meaning: BoW does not capture the order or semantics of the words in the text, so it cannot understand the context in which words are used.

  2. Sparse representations: BoW representations are typically high-dimensional (due to the size of the vocabulary) and sparse (since each document only contains a small subset of words). This can lead to computational challenges and inefficiencies.

  3. Stop words: Common words that don’t contain important meaning, known as stop words (e.g., “is”, “the”, “and”) are often represented frequently in the BoW model, which can lead to less meaningful representations.

  4. Word sense disambiguation: BoW cannot handle words with multiple meanings.

Despite these limitations, the Bag of Words model is a fundamental step in understanding text data in machine learning. It forms the basis for more sophisticated methods, like TF-IDF and Word2Vec, which we will explore in more depth in subsequent lectures.