N-gram Language Models#

N-gram language models are a simple yet powerful approach to language modeling. They predict the next word in a sequence based on the previous n-1 words, where n is the order of the model. The main types of n-grams are:

  • Unigram: Considers only a single word, ignoring context (n=1).

  • Bigram: Considers a sequence of two words (n=2).

  • Trigram: Considers a sequence of three words (n=3).

  • Higher-order n-grams: Considers longer sequences of words (n>3).

In n-gram models, the probability of a word depends only on the (n-1) previous words. This simplification is known as the Markov assumption. To estimate the probabilities, n-gram models rely on counting the occurrences of n-grams in a large corpus and normalizing the counts.

Advantages of n-gram models:

  • Relatively simple and easy to implement.

  • Efficient in terms of computation and memory usage.

Limitations of n-gram models:

  • Unable to capture long-range dependencies between words.

  • Sensitive to data sparsity issues, as some n-grams may not appear in the training corpus.

Despite their limitations, n-gram models serve as a foundational tool for understanding language modeling concepts and are still used in various NLP applications.

N-Grams and Probability Estimation#

N-grams can be used to estimate the probability of a word given a history (P(w|h)). For instance, if the history h is “I like to eat”, we might want to estimate the probability of the next word being “pizza” (P(pizza|I like to eat)).

Estimating Probabilities from Counts#

We can estimate this probability by counting the occurrences of the history followed by the target word in a large corpus and dividing by the total count of the history:

\[ P(\text{pizza}|\text{I like to eat}) = \frac{C(\text{I like to eat pizza})}{C(\text{I like to eat})} \]
\[ P(\text{먹습니다}|\text{저는 김치를}) = \frac{C(\text{저는 김치를 먹습니다})}{C(\text{저는 김치를})} \]

However, even with a large corpus, it’s often not sufficient to provide good estimates due to the creative nature of language and the possibility of unseen word sequences.

Bigram Model#

To tackle this issue, we can use the bigram model, which approximates the probability of a word given its entire history by considering only the preceding word:

\[ P(\text{pizza}|\text{I like to eat}) \approx P(\text{pizza}|\text{eat}) \]
\[ P(\text{먹습니다}|\text{저는 김치를}) \approx P(\text{먹습니다}|\text{김치를}) \]

This simplification allows us to estimate probabilities more reliably, but it may not capture longer context dependencies. Nevertheless, n-grams serve as a foundational tool for understanding language modeling concepts and can be useful in various NLP applications.

Estimating Joint Probabilities of Word Sequences#

To estimate the joint probability of an entire sequence of words, such as “The cat sat on the mat”, we can decompose this probability using the chain rule of probability:

\[ P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n−1}) = \prod_{k=1}^n P(w_k|w_{1:k−1}) \]

This decomposition shows the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. However, it doesn’t really seem to help us! Computing the exact probability of a word given a long sequence of preceding words (e.g., \(P(w_n|w_{1:n−1})\)) is challenging because language is creative, and any particular context might have never occurred before.

Example#

For the sequence “The cat sat on the mat”, we can decompose the joint probability as follows:

\[ P(\text{The, cat, sat, on, the, mat}) = P(\text{The})P(\text{cat}|\text{The})P(\text{sat}|\text{The, cat})...P(\text{mat}|\text{The, cat, sat, on, the}) \]

Estimating each conditional probability using counts from a large corpus is not feasible since many long sequences might never have occurred before.

N-gram Models and Markov Assumption#

N-gram models are used to predict the probability of a word given a fixed number of preceding words. The assumption that the probability of a word depends only on a limited number of previous words is called the Markov assumption.

For a bigram model (N=2), the approximation is:

\[ P(w_n|w_{1:n−1}) ≈ P(w_n|w_{n−1}) \]

For an n-gram model with size N, the approximation is:

\[ P(w_n|w_{1:n−1}) ≈ P(w_n|w_{n−N+1:n−1}) \]

Given the n-gram assumption for the probability of an individual word, we can compute the probability of a complete word sequence as:

\[ P(w_{1:n}) ≈ \prod_{k=1}^n P(w_k|w_{k−N+1:k−1}) \]

Example

For the trigram model (N=3), the approximation for the sequence “The cat sat on the mat” is:

\[ P(\text{The, cat, sat, on, the, mat}) ≈ P(\text{The})P(\text{cat}|\text{The})P(\text{sat}|\text{The, cat})P(\text{on}|\text{cat, sat})P(\text{the}|\text{sat, on})P(\text{mat}|\text{on, the}) \]

Stock Prices and Markov Assumption#

The Markov assumption can be applied to model stock prices as well. In finance, the Markov assumption is often used to represent the idea that future stock prices depend only on the current price and a limited number of past prices, rather than the entire price history.

The relationship between stock prices and the Markov assumption can be understood as follows:

  1. Memoryless Property: The Markov assumption implies that the stock price at a certain time is only influenced by a fixed number of previous time steps. This means that the future price movement doesn’t depend on the entire history but only on the recent past. This property is also known as the memoryless property of Markov models.

  2. Simplifying Complex Systems: Stock prices are affected by a vast number of factors, including market trends, company performance, global events, and investor sentiment. By applying the Markov assumption, we can simplify the modeling of stock prices by focusing only on the most recent price changes, which are assumed to capture relevant information.

  3. Prediction and Analysis: Using the Markov assumption in financial models allows us to predict and analyze stock price movements. For example, we can create Markov models to estimate the probabilities of stock price changes, which can be useful for trading strategies, risk management, and portfolio optimization.

It’s important to note that the Markov assumption is a simplification and may not always accurately represent the complexities of the stock market. However, it serves as a useful tool in modeling and analyzing stock prices.

Estimating Bigram or N-gram Probabilities using Maximum Likelihood Estimation (MLE)#

To estimate the probabilities for bigrams or n-grams, we use maximum likelihood estimation (MLE) by counting the occurrences in a corpus and normalizing the counts.

MLE, or Maximum Likelihood Estimation, is a statistical method used for estimating the parameters of a probability distribution or a model. It works by finding the parameter values that maximize the likelihood of the observed data under the given model. In other words, MLE selects the parameters that make the observed data most probable.

For example, if we have a dataset of coin tosses (heads and tails), and we want to estimate the probability of getting heads, we can use MLE to find the parameter value that makes the observed sequence of coin tosses most likely. This is usually done by calculating the ratio of the number of heads to the total number of tosses.

Bigram Probability Calculation#

Compute the bigram probability of a word \(w_n\) given the previous word \(w_{n-1}\):

\[ P(w*n|w*{n−1}) = C(w*{n−1}w_n) / C(w*{n−1}) \]

where \(C(w_{n−1} w_n)\) is the count of the bigram \(w_{n−1}w_n\) and \(C(w_{n−1})\) is the count of the unigram \(w_{n−1}\).

Example

Consider a mini-corpus with three sentences:

  1. <s> I am Sam </s>

  2. <s> Sam I am </s>

  3. <s> I do not like green eggs and ham </s>

Here are the bigram probabilities for some pairs in this corpus:

Bigram

Probability

P(I|<s>)

2/3

P(Sam|<s>)

1/3

P(am|I)

2/3

P(</s>|Sam)

1/2

P(Sam|am)

1/2

P(do|I)

1/3

For general MLE n-gram parameter estimation:

\[ P(w*n|w*{n−N+1:n−1}) = C(w*{n−N+1:n−1} w_n) / C(w*{n−N+1:n−1}) \]

Bigram probabilities capture various linguistic phenomena, such as syntax, task-specific patterns, and cultural preferences.

Practical Issues in N-gram Models#

  1. Higher-order n-grams: In practice, trigram (conditioning on the previous two words), 4-gram, or even 5-gram models are more common when there is sufficient training data. For larger n-grams, extra contexts are needed to the left and right of the sentence end (e.g., P(I|<s><s>) for trigrams).

  2. Log probabilities: Since multiplying probabilities can lead to numerical underflow, it’s better to represent and compute language model probabilities in log format. Adding log probabilities is equivalent to multiplying probabilities in linear space. Convert back to probabilities when needed by taking the exponential of the log probability:

\[ p_1 × p_2 × p_3 × p_4 = \exp(\log{p_1} + \log{p_2} + \log{p_3} + \log{p_4}) \]

Examples of MLE#

Here’s a practical example of MLE using Python. We’ll estimate the parameter p of a Bernoulli distribution (coin toss) based on some observed data.

import numpy as np

# Observed data (1 for heads, 0 for tails)
data = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1])

# Calculate MLE for p (probability of heads)
mle_p = np.mean(data)

print("MLE for p (probability of heads):", mle_p)
MLE for p (probability of heads): 0.65

In this example, we have a sequence of 20 coin tosses, where 1 represents heads, and 0 represents tails. To estimate the probability of getting heads (p) using MLE, we simply calculate the mean of the observed data, which is the ratio of the number of heads to the total number of tosses.

Here’s another example, this time estimating the mean (mu) and standard deviation (sigma) of a Gaussian distribution given some observed data.

import numpy as np

# Observed data
data = np.array([2.5, 3.2, 4.1, 3.7, 2.8, 3.9, 4.5, 3.1, 2.9, 4.2])

# Calculate MLE for mu (mean) and sigma (standard deviation)
mle_mu = np.mean(data)
mle_sigma = np.std(data)

print("MLE for mu (mean):", mle_mu)
print("MLE for sigma (standard deviation):", mle_sigma)
MLE for mu (mean): 3.4899999999999998
MLE for sigma (standard deviation): 0.6441273166075167

How to generate n-grams using NLTK#

Here’s an example of how to generate n-grams using the Natural Language Toolkit (NLTK) library in Python.

First, make sure to install NLTK:

%pip install nltk
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 27.5 MB/s eta 0:00:00a 0:00:01
?25hRequirement already satisfied: click in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk) (8.1.3)
Requirement already satisfied: tqdm in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk) (4.65.0)
Collecting joblib
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.0/298.0 kB 100.6 MB/s eta 0:00:00
?25hRequirement already satisfied: regex>=2021.8.3 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk) (2022.10.31)
Installing collected packages: joblib, nltk
Successfully installed joblib-1.2.0 nltk-3.8.1

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Now, let’s generate bigrams, trigrams, and 4-grams using NLTK:

import nltk
from nltk.util import ngrams

# Download the punkt tokenizer
nltk.download("punkt")

# Sample text
text = "I love natural language processing and machine learning."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Generate bigrams
bigrams = list(ngrams(tokens, 2))
print("Bigrams:")
print(bigrams)

# Generate trigrams
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
print(trigrams)

# Generate 4-grams (quadgrams)
quadgrams = list(ngrams(tokens, 4))
print("\n4-grams:")
print(quadgrams)
[nltk_data] Downloading package punkt to /home/yjlee/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Bigrams:
[('I', 'love'), ('love', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'and'), ('and', 'machine'), ('machine', 'learning'), ('learning', '.')]

Trigrams:
[('I', 'love', 'natural'), ('love', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'and'), ('processing', 'and', 'machine'), ('and', 'machine', 'learning'), ('machine', 'learning', '.')]

4-grams:
[('I', 'love', 'natural', 'language'), ('love', 'natural', 'language', 'processing'), ('natural', 'language', 'processing', 'and'), ('language', 'processing', 'and', 'machine'), ('processing', 'and', 'machine', 'learning'), ('and', 'machine', 'learning', '.')]

Evaluating Language Models#

Evaluating language models can be done in two main ways: extrinsic evaluation and intrinsic evaluation.

Extrinsic evaluation is the process of embedding a language model in an application and measuring its performance improvement in that application. For instance, in speech recognition, we can compare two language models by running a speech recognizer with each model and determining which one produces more accurate transcriptions.

However, running large NLP systems end-to-end can be resource-intensive. As an alternative, we can use intrinsic evaluation metrics to quickly evaluate potential improvements in a language model. Intrinsic evaluation measures the quality of a model independent of any application.

To perform an intrinsic evaluation, we need a test set and a training set. The training set is used to train the parameters of the n-gram model, while the test set measures the model’s performance. The better model is the one that assigns a higher probability to the test set, meaning it more accurately predicts the test set.

It’s crucial not to include test sentences in the training set, as this would introduce bias and lead to inaccurate probability-based metrics. Sometimes, we might use a test set so frequently that we tune to its characteristics. In such cases, a fresh test set, called the development test set or devset, should be used.

To divide data into training, development, and test sets, a common practice is to allocate 80% for training, 10% for development, and 10% for testing. Test data can either be taken from a continuous sequence within the corpus or be selected as smaller “stripes” from random parts of the corpus combined into a test set.

Perplexity#

Perplexity is a measure used to evaluate the performance of a language model. It quantifies how well a model predicts the probability distribution of a test set. A lower perplexity value indicates that the model is more accurate in predicting the test set.

Perplexity can be understood as the weighted average branching factor of a language model. In other words, it represents the average number of possible next words the model considers at each step while generating a sentence. Lower perplexity means the model has fewer choices to consider, resulting in better prediction accuracy.

The perplexity (PP) of a language model on a test set can be calculated using the following equation:

\[ PP(W) = P(w_1, w_2, ..., w_N)^{-1/N} \]

Here, W is the sequence of words in the test set, and N is the total number of words in the test set. \(P(w_1, w_2, ..., w_N)\) is the joint probability of the entire word sequence in the test set.

Since we usually work with log probabilities to avoid numerical underflow, we can rewrite the perplexity equation using log probabilities:

\[ PP(W) = P(w_1, w_2, ..., w_N)^{-1/N} \]

In the context of n-gram models, we can compute the perplexity by replacing the joint probability with the product of conditional probabilities of each word given its (n-1) previous words:

\[ PP(W) = exp(-1/N * (log(P(w_1|w_0)) + log(P(w_2|w_1)) + ... + log(P(w_N|w_{N-1})))) \]

Example

Suppose we have a bigram model and the following test sentence: “I like pizza”. The perplexity of the model on this test sentence can be calculated using the bigram probabilities as follows:

  1. Compute the log probabilities: \(\log(P(\text{I}|\text{<s>}))\), \(\log(P(\text{like}|\text{I}))\), \(\log(P(\text{pizza}|\text{like}))\), and \(\log(P(\text{</s>}|\text{pizza}))\).

  2. Add up the log probabilities.

  3. Divide the sum by the number of words in the test sentence (including the start and end symbols): \(-1/5 * \text{sum of log probabilities}\).

  4. Take the exponent of the result: \(\exp(-1/5 * \text{sum of log probabilities})\).

The perplexity value obtained will be a measure of how well the bigram model predicts the test sentence “I like pizza”.

Here’s how to implement perplexity manually using Python and NLTK. We’ll use the Brown Corpus and train a trigram model.

import nltk
import math
from nltk.util import ngrams
from collections import Counter, defaultdict
from nltk.corpus import brown

nltk.download("brown")

# Set n-gram order to 3 (trigram)
n = 3

# Divide the data into 80% training and 20% test
train_size = int(len(brown.sents()) * 0.8)
train_data = brown.sents()[:train_size]
test_data = brown.sents()[train_size:]

# Create n-grams and vocabulary from training data
ngrams_counts = defaultdict(Counter)
vocabulary = set()

for sentence in train_data:
    for gram in range(1, n + 1):
        for ng in ngrams(sentence, gram, pad_left=True, pad_right=True):
            ngrams_counts[gram][ng] += 1
            vocabulary.update(ng)

V = len(vocabulary)  # Vocabulary size
[nltk_data] Downloading package brown to /home/yjlee/nltk_data...
[nltk_data]   Package brown is already up-to-date!

Now, let’s calculate the perplexity:

# Calculate the probability of a trigram using Laplace smoothing
def trigram_prob(trigram):
    bigram = trigram[:-1]
    return (ngrams_counts[3][trigram] + 1) / (ngrams_counts[2][bigram] + V)


# Calculate perplexity on the test data
log_prob_sum = 0
test_word_count = 0

for sentence in test_data:
    for trigram in ngrams(sentence, n, pad_left=True, pad_right=True):
        log_prob_sum += math.log2(trigram_prob(trigram))
        if trigram[-1] is not None:
            test_word_count += 1

perplexity = 2 ** (-log_prob_sum / test_word_count)
print(f"Perplexity: {perplexity}")
Perplexity: 39764.020490463474

Perplexity’s Relation to Entropy#

Perplexity is closely related to entropy, a concept from information theory. Entropy is a measure of the average amount of information (in bits) needed to represent the outcome of a random process, such as the choice of a word in a sentence. Lower entropy indicates that the outcomes are more predictable, while higher entropy implies more uncertainty.

For a language model, the entropy is defined as the average negative log probability of the words in the test set, normalized by the number of words. Mathematically, the entropy \(H(W)\) can be represented as:

\[ H(W) = - (1/N) * ∑ P(w_i | w_{i-1}, ..., w_{i-n+1}) * log2(P(w_i | w_{i-1}, ..., w_{i-n+1})) \]

Perplexity (PP) is the exponentiation of entropy and can be seen as the weighted average branching factor of a language model. It gives an indication of how well the model predicts the test data. Lower perplexity values indicate better predictive performance. The relationship between perplexity and entropy is:

\[ PP(W) = 2^H(W) \]

In summary, perplexity is closely related to entropy. Entropy measures the average amount of information needed to represent the outcomes of a random process, and perplexity is the exponentiation of entropy. Lower perplexity and entropy values indicate better predictive performance and more certainty in the language model’s predictions.