Usage of Language Models#

Sampling Sentences from a Language Model#

Sampling sentences from a language model is a method to generate sentences based on the probabilities defined by the model. By doing this, we can visualize what the language model has learned and understand its knowledge representation. In this process, sentences with higher probabilities are more likely to be generated than those with lower probabilities.

For example, consider a unigram language model. We can visualize the probability distribution of all the words in the model’s vocabulary, each word covering an interval proportional to its frequency. To generate a sentence, we randomly pick a value between 0 and 1, and select the word whose interval includes the chosen value. We repeat this process until we generate the sentence-final token </s>.

For bigram models, the process is similar. We first generate a random bigram that starts with <s> based on its bigram probability. Let’s say the second word of the bigram is w. We then choose a random bigram starting with w according to its bigram probability, and continue this process until we generate a sentence.

In essence, sampling sentences from a language model helps us understand the kind of sentences the model considers to be more likely and thus provides insights into its knowledge representation.

import random

# Sample bigram model
bigram_model = {
    "<s>": {"I": 0.5, "You": 0.5},
    "I": {"like": 0.5, "love": 0.5},
    "You": {"like": 0.5, "love": 0.5},
    "like": {"apples": 0.5, "bananas": 0.5},
    "love": {"apples": 0.5, "bananas": 0.5},
    "apples": {"</s>": 1.0},
    "bananas": {"</s>": 1.0},
}


def generate_sentence(model):
    sentence = []
    current_word = "<s>"

    while current_word != "</s>":
        next_word_candidates = list(model[current_word].keys())
        next_word_probabilities = list(model[current_word].values())
        current_word = random.choices(next_word_candidates, next_word_probabilities)[0]

        if current_word != "</s>":
            sentence.append(current_word)

    return " ".join(sentence)


# Generate 5 sentences
for _ in range(5):
    print(generate_sentence(bigram_model))
You love apples
You love bananas
You like apples
You like bananas
I love apples

This code defines a simple bigram model and a function generate_sentence to generate sentences using this model. The function starts with the <s> token and continues generating words based on the bigram probabilities until it encounters the </s> token. Running this code will generate and print 5 sentences using the given bigram model. Note that this example uses a very simple and limited bigram model, but the same principle can be applied to more complex models.

An example using the Brown Corpus#

import nltk
import random
from collections import defaultdict, Counter
from nltk.corpus import brown

nltk.download("brown")

# Load Brown news corpus
news_sents = brown.sents(categories="news")


# Function to create n-gram models
def create_ngram_model(sentences, n):
    model = defaultdict(Counter)
    for sent in sentences:
        sent = ["<s>"] * (n - 1) + sent + ["</s>"]
        for i in range(len(sent) - n + 1):
            ngram = tuple(sent[i : i + n])
            prefix, word = ngram[:-1], ngram[-1]
            model[prefix][word] += 1

    # Convert counts to probabilities
    for prefix, word_counts in model.items():
        total = sum(word_counts.values())
        for word, count in word_counts.items():
            model[prefix][word] = count / total

    return model


# Function to generate sentences from n-gram models
def generate_sentence(model):
    sentence = []
    current_ngram = ("<s>",) * (len(list(model.keys())[0]))

    while "</s>" not in current_ngram:
        next_word_candidates = list(model[current_ngram].keys())
        next_word_probabilities = list(model[current_ngram].values())
        next_word = random.choices(next_word_candidates, next_word_probabilities)[0]

        if next_word != "</s>":
            sentence.append(next_word)

        current_ngram = current_ngram[1:] + (next_word,)

    return " ".join(sentence)


# Create bigram, trigram, and 4-gram models
bigram_model = create_ngram_model(news_sents, 2)
trigram_model = create_ngram_model(news_sents, 3)
fourgram_model = create_ngram_model(news_sents, 4)
[nltk_data] Downloading package brown to /home/yjlee/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Bigram model sentence: The first year hailed the queen , Brooks Robinson slammed a local director at Rice Stadium Oct. 14 , and must have never have never is to solicit funds for violating the government and won it .
Trigram model sentence: When Mickey went to see them .
4-gram model sentence: He managed to maneuver the missile to a landing speed of 200 m.p.h. -- fast even for a high school where there were lots of cars `` might not be realistic and would not work '' .
# Generate sentences using the models
print("Bigram model sentence:", generate_sentence(bigram_model))
print("Trigram model sentence:", generate_sentence(trigram_model))
print("4-gram model sentence:", generate_sentence(fourgram_model))
Bigram model sentence: A cheer here is clearly that Georgia to the Texas , activities .
Trigram model sentence: Indications as late as the views of another one .
4-gram model sentence: The game players saw the Air Force Academy .

This code loads the Brown news corpus, defines functions to create n-gram models and generate sentences using these models, and then creates bigram, trigram, and 4-gram models. It generates and prints one sentence for each model. Note that the generated sentences may vary each time the code is run due to the random sampling process.

Unknown Words#

In language modeling, we may encounter words that we’ve never seen before in the training data, known as unknown words or out-of-vocabulary (OOV) words. Handling these unknown words is an important aspect of language modeling.

There are two common strategies for dealing with unknown words:

  1. Closed Vocabulary System: In this approach, we assume a fixed vocabulary in advance. The test set can only contain words from this known vocabulary, and there will be no unknown words. However, this is not a practical approach in most real-world situations.

  2. Open Vocabulary System: We create an open vocabulary system by adding a pseudo-word <UNK> to represent all potential unknown words in the test set. There are two ways to train the probabilities of the unknown word model <UNK>:

  • Choose a fixed vocabulary (word list) in advance, convert any word not in this set to the unknown word token <UNK> in the training set, and estimate the probabilities for <UNK> based on its counts just like any other regular word in the training set.

  • Alternatively, create a vocabulary implicitly by replacing words in the training data with <UNK> based on their frequency. For example, we can replace all words that occur fewer than n times in the training set by <UNK> or select a vocabulary size V in advance and choose the top V words by frequency and replace the rest by <UNK>. Then, proceed to train the language model, treating <UNK> as a regular word.

The choice of <UNK> affects metrics like perplexity, as a language model can achieve low perplexity by choosing a small vocabulary and assigning the unknown word a high probability. Perplexities can only be compared across language models with the same vocabularies.

Smoothing#

Smoothing is an essential technique in language modeling to address the problem of unseen n-grams in the training data. The main idea behind smoothing is to assign non-zero probabilities to these unseen n-grams, which helps avoid issues like zero probabilities when evaluating a language model on a test set.

For example, consider a bigram language model trained on a corpus. If a specific word pair (bigram) never appeared in the training data but appears in the test set, the model would assign it a probability of zero. This zero probability would then affect the entire sentence’s probability, making it zero as well. Smoothing techniques help prevent this problem by redistributing some probability mass from seen n-grams to unseen n-grams.

There are various smoothing techniques, such as Additive (Laplace) smoothing, Good-Turing smoothing, and Kneser-Ney smoothing. One common method is Additive (Laplace) smoothing:

Let’s assume we have a bigram language model with vocabulary V and bigram counts \(C(w_{i-1}, w_i)\). Without smoothing, the probability of a bigram is calculated as:

\[ P(w_i | w_{i-1}) = C(w_{i-1}, w_i) / \sum C(w_{i-1}, w) \]

With Additive (Laplace) smoothing, we add a constant k (usually k=1) to the count of each bigram, which in turn adds a non-zero probability to unseen bigrams:

\[ P(w_i | w_{i-1}) = (C(w_{i-1}, w_i) + k) / ( \sum C(w_{i-1}, w) + k * |V| ) \]

Here, \(|V|\) represents the size of the vocabulary. By adding k to each bigram count, we ensure that unseen bigrams get a non-zero probability, thus preventing zero probabilities in the language model evaluation.

In summary, smoothing techniques in language modeling help to handle unseen n-grams by redistributing probability mass and ensuring that no n-gram has a zero probability.

Huge Language Models and Stupid Backoff#

In some cases, we might want to use larger n-gram models (e.g., 4-gram, 5-gram) to capture more context and improve the performance of the language model. However, as the size of n-grams increases, the amount of training data required and the model’s computational complexity also increase. One simple and efficient approach to handle larger n-gram models is the “Stupid Backoff” method.

The Stupid Backoff technique is an approximation strategy that doesn’t guarantee probabilities to sum to one but is computationally efficient and works well in practice. The main idea is to use higher-order n-grams when available and “back off” to lower-order n-grams when higher-order n-grams are not observed in the training data.

For example, given a 4-gram language model, the Stupid Backoff strategy can be defined as follows:

  1. If the 4-gram \((w_1, w_2, w_3, w_4)\) has been seen in the training data, use its probability.

  2. If the 4-gram is not observed, back off to the trigram \((w_2, w_3, w_4)\) and multiply its probability by a constant factor α.

  3. If the trigram is also not observed, back off to the bigram \((w_3, w_4)\) and multiply its probability by α².

  4. If the bigram is not observed, use the unigram \((w_4)\) probability and multiply it by \(\alpha^3\).

The constant factor α is usually a value between 0 and 1, such as 0.4. The formula for Stupid Backoff can be represented as:

\[ P(w_4 | w_1, w_2, w_3) \approx \alpha^k * P(w_4 | w_{4-k}, w_{3-k}, \ldots, w_3) \]

Here, k is the number of steps we had to back off (e.g., k = 0 for 4-grams, k = 1 for trigrams, etc.).

In summary, Stupid Backoff is a computationally efficient strategy to handle large n-gram models by backing off to lower-order n-grams when higher-order n-grams are not observed in the training data. While it doesn’t guarantee proper probability distributions, it works well in practice for tasks like language modeling and machine translation.

Summary#

N-gram language models are a fundamental technique in natural language processing (NLP) for predicting and understanding sequences of words in a text. They model the probability of a word given a fixed number of previous words (n-1), where n denotes the size of the n-gram. N-grams are useful in various NLP tasks, such as speech recognition, machine translation, and text generation.

There are different types of n-gram models, such as unigram, bigram, trigram, and higher-order models. The choice of n depends on the available data and the desired balance between model complexity and generalization.

To estimate the probabilities in an n-gram model, we usually employ maximum likelihood estimation (MLE). MLE calculates the probability of a word by counting its occurrences in the training corpus and normalizing by the total number of occurrences of the previous (n-1) words.

However, MLE suffers from the data sparsity problem, where many n-grams may not appear in the training data, resulting in zero probabilities. To address this issue, we use smoothing techniques, such as Laplace smoothing or Kneser-Ney smoothing, to assign non-zero probabilities to unseen n-grams.

Perplexity is a widely-used metric to evaluate the performance of language models. It measures the average branching factor of the model, with lower perplexity indicating better predictive performance.

In real-world applications, we often encounter unknown or out-of-vocabulary (OOV) words. To handle OOV words, we can use a special <UNK> token and incorporate it into the language model during training.

Finally, n-gram models have limitations, such as the inability to capture long-range dependencies and the curse of dimensionality with higher-order models. However, they still form a crucial foundation for modern NLP techniques and serve as a building block for more advanced models.

In short,

  • N-gram language models predict word sequences in text using fixed-size context.

  • Unigram, bigram, trigram, and higher-order models represent different n-gram sizes.

  • Maximum likelihood estimation (MLE) calculates probabilities based on word occurrences.

  • Data sparsity problem addressed with smoothing techniques (e.g., Laplace, Kneser-Ney).

  • Perplexity measures language model performance; lower values indicate better prediction.

  • Unknown or out-of-vocabulary (OOV) words handled using a special <UNK> token.

  • N-gram models have limitations, but they remain foundational in natural language processing.

References#