Decoding and Search Strategies#

In recent years, there has been a surge of interest in open-ended language generation due to the development of large language models (LLMs) like GPT-2, XLNet, OpenAI-GPT, CTRL, TransfoXL, XLM, BART, T5, GPT-3, and BLOOM. These models have shown promising results in various generation tasks such as open-ended dialogue, summarization, and story generation. Improved decoding methods have played a significant role in the success of these models.

Auto-regressive language generation assumes that the text being generated can be broken down into a sequence of subparts. Each part depends on the previous parts, allowing an auto-regressive decoder to generate text one token at a time based on its predecessors.

The probability of generating a word sequence \(w_{1:𝑇}\) given an initial context word sequence \(W_0\) can be expressed as:

\[ P(w*{1:T} | W_0 ) = \prod*{t=1}^T P(w*{t} | w*{1: t-1}, W*0) \text{ ,with } w*{1: 0} = \emptyset, \]

Here, \(W_0\) is the initial context word sequence. The length 𝑇 of the word sequence is determined on-the-fly, which means it is decided as the sequence is generated. The sequence generation typically stops when an End-Of-Sequence (EOS) token is generated from the probability distribution \(𝑃(w_t|w_{1:t−1},W_0)\).

This auto-regressive approach allows LLMs to generate coherent and contextually relevant text based on the initial context 𝑊0. Different decoding strategies, such as greedy search, beam search, and sampling methods like top-k sampling, have been used to improve the generation quality and diversity.

For example, consider a language model generating a story based on the context “Once upon a time, in a small village”:

  1. Greedy search would select the word with the highest probability at each step, potentially leading to repetitive and less diverse text.

  2. Beam search would maintain a fixed number of partial sequences (the beam width) and extend them, selecting the most probable overall sequence. This can improve diversity but may still suffer from repetitiveness.

  3. Top-k sampling would sample the next word from the top-k most probable words, increasing diversity in the generated text.

These strategies help LLMs generate meaningful and diverse text for various language generation tasks.

# If you run this notebook in Colab, set Hardware accelerator to GPU.
# Then, install transformers
%pip install -U transformers tensorflow
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.

Sampling#

Sampling means randomly picking the next word \(w_t\) according to its conditional probability distribution:

\[ w*t \sim P(w|w*{1:t-1}) \]

The following graphic visualizes language generation when sampling.

In contrast to deterministic methods like greedy search, language generation using sampling introduces an element of randomness.

For example, given the starting word “The”, the next word “car” is sampled from the conditional probability distribution \(P(w | \text{"The"})\). Following this, the word “drives” is sampled from the distribution \(P(w | \text{"The"}, \text{"car"})\).

This stochastic approach results in a more diverse range of generated text compared to deterministic methods, as it allows for the exploration of various word combinations based on their conditional probabilities.

In the following code snippet, we set do_sample=True and deactivate Top-K sampling via top_k=0. By doing this, we enable sampling for generating text:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=0,
)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, especially to see how it does useful things over evolutionary time once you shift the ideology of theoretical language. I also really like developing algorithmically tailored introductory programming books to teach my students, but I simply don't have the spare time to do all that I did before. HALAX fight3r also released their original issue warning the public that Gaming with Game Software may be detrimental to your solidified web skills and appearances, likely to undermine your beliefs and help

While the generated text seems fine at first glance, a closer look reveals that it lacks coherence and contains words that do not sound natural. This incoherent gibberish is a common problem with sampling word sequences.

To address this issue, we can adjust the temperature of the softmax function. This modification sharpens the distribution \(P(w|w\_{1:t-1})\), increasing the likelihood of high probability words and decreasing the likelihood of low probability words. The following code snippet demonstrates how to apply a temperature of 0.7:

../../../_images/deepnlp_2_sampling_search_with_temp.png

Fig. 94 Sampling Search with Temperature#

Note

Temperature is a hyperparameter used in language generation models to control the randomness and diversity of the generated text. It is applied to the softmax function, which converts logits (the raw model outputs) into probabilities. The temperature essentially adjusts the sharpness of the probability distribution over the possible next words in a sequence.

When the temperature is high (e.g., greater than 1), the model’s output probabilities become more uniform, meaning that the generated text will be more diverse and random, sometimes even incoherent. High temperature values encourage the model to explore less likely words and make it more creative.

On the other hand, when the temperature is low (e.g., less than 1), the probability distribution becomes more focused, and the model becomes more conservative. The generated text will be more coherent and predictable, with a higher probability of choosing the most likely words. However, when the temperature is too low (approaching 0), the generated text may become too repetitive and less interesting, resembling the output of greedy decoding.

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=0,
    temperature=0.7,
)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, but this is a simplistic example of how the language can be applied to a complex problem.

Using more complex architectures

The most common approach is to look at a whole set of ML designs that are quite complex and require a lot of work to deploy. For example, the first ML project I'm learning is the human language learning project. Let's take a look at the human-language learning project.

The current human-language

By applying a temperature of 0.7, the generated text becomes less random and more coherent, with fewer unnatural n-grams. However, it’s important to note that as the temperature approaches 0, the sampling process becomes similar to greedy decoding, which comes with its own set of issues, such as a lack of diversity in the generated text.

Top-K Sampling#

../../../_images/deepnlp_2_top_k_sampling.png

Fig. 95 Top-K Sampling#

Top-K sampling is a technique used in language generation models to make the generated text more coherent and meaningful. In Top-K sampling, only the \(K\) most likely next words are considered, and the probability mass is redistributed among these \(K\) next words.

Setting \(K=6\), for example, limits our sampling pool to the top 6 words in each step. This can eliminate unlikely candidates, making the generated text more human-like. However, one concern with Top-K sampling is that it doesn’t adapt dynamically to the number of words filtered from the next word probability distribution \(P(w|w_{1:t-1})\). This could lead to problems, as some words might be sampled from a very sharp distribution, while others from a much flatter distribution.

Let’s see how Top-K can be used in the library by setting top_k=50:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=50,
)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, as well as in terms of learning from experience. As well as having my students who are learning an intermediate language and learn how to use it in a meaningful way or with a minimal effort, I can also focus on learning programming languages like C++ (though I wouldn't go so far as to call that "advanced"), Python (for those that aren't familiar with Python) or Java (I think I actually want to see what I can learn

The generated text appears to be more human-sounding than before. However, using a fixed K size can have drawbacks:

  1. It could limit the model’s creativity for flatter probability distributions, as it might filter out reasonable candidates that could have led to more diverse and interesting text.

  2. For sharper distributions, it could include ill-fitted words in the sample pool, endangering the model to produce gibberish.

To address these concerns, Ari Holtzman et al. (2019) introduced Top-p or nucleus sampling, which dynamically selects the \(K\) value based on a threshold probability \(p\), leading to a more adaptive approach.

Top-p (nucleus) sampling#

../../../_images/deepnlp_2_top_p_sampling.png

Fig. 96 Top-p Sampling#

Top-p sampling, also known as nucleus sampling, is a technique that chooses words from the smallest possible set whose cumulative probability exceeds a predefined threshold probability p. The probability mass is then redistributed among this set of words. This approach allows the size of the word set to dynamically increase and decrease according to the next word’s probability distribution.

In Top-p sampling, the model selects the minimum number of words needed to exceed the probability mass p. For example, with p=0.92, the model selects the most likely words whose cumulative probability is at least 92%. This method keeps a wide range of words when the next word is less predictable and narrows down the selection when the next word seems more predictable.

To activate Top-p sampling, set 0 < top_p < 1:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_p=0.92,
    top_k=0,
)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing. Learn more at lecture.lightlink.com.

Ciao — Social conscious "synchronization" between conscious and nonconscious participants at the Cognition Experience. Learn more at lecture.lightlink.com.

Everett's Processio

A video discussion

The Evolution of Mindfulness in Today's Economy

The Recent Advances in the Study of Mindfulness and Success in Society. The results from St Louis University

The generated text appears more human-like. In practice, both Top-p and Top-K sampling work well. You can also combine Top-p with Top-K sampling, which can prevent the selection of very low-ranked words while allowing for some dynamic word selection.

To get multiple independently sampled outputs, set the parameter num_return_sequences > 1:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * "-")
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
Output:
----------------------------------------------------------------------------------------------------
0: I enjoy studying deep learning for natural language processing, as I've learned it quickly enough for me to understand it. The idea of modeling learning comes from both of these worlds. The first is that of natural language processing and the second is a mathematical theory that allows you to draw meaningful conclusions. To understand what is in the world of Natural Language Processing, go to the book by Mike Biederman of the University of Toronto or check out his website at www.layers.com. You are also
1: I enjoy studying deep learning for natural language processing. My favourite part, even though it is just about my main job, is learning to play a game. That is exactly what I am doing on my website. All this is part of my job as a developer and I love playing games.

My job with this company was to create a website where my students could explore the world as they would enjoy learning about computer science, physics, or any other science.

In doing so, I
2: I enjoy studying deep learning for natural language processing. In my experience it doesn't help my training to teach my students how to do it, but it's an awesome source for training and has led to many exciting discoveries about Deep Learning and neural nets in general.

Learning to code

Another way to learn Deep Learning is by coding. For example, I often code to solve problem A and then use it for the solving of problem B. This is where the language really comes in. I

The output consists of multiple independently sampled sequences, which can be further analyzed or used according to specific requirements.

Summary#

Various decoding or search strategies are used in open-ended language generation tasks. Here’s an overview of the most common strategies and their characteristics:

  • Greedy Search:

    • At each time step, Greedy Search selects the word with the highest predicted probability to follow the previous word.

    • One of the main issues with Greedy Search is that it may miss high-probability words at a certain time step if they are preceded by a low-probability word in the previous step.

    • This method often generates repetitive and predictable text, which may not resemble natural human language.

  • Beam Search:

    • Beam Search tracks the n-th (num_beams) most likely word sequences and ultimately outputs the most likely sequence.

    • While this method may seem promising, it struggles when the output length can be highly variable, as is the case with open-ended text generation.

    • Like Greedy Search, Beam Search can also produce repetitive and uninteresting text that doesn’t align with the way humans perform the same task.

  • Sampling With Top-k + Top-p:

    • This strategy combines three methods to address the issues faced by Greedy and Beam Search.

    • Sampling involves choosing the next word randomly based on its conditional probability distribution.

    • In Top-k sampling, the model selects the k most likely words and redistributes the probability mass among them before choosing the next word.

    • Top-p sampling adds an additional constraint to Top-k by choosing words from the smallest set whose cumulative probability exceeds p.

    • While these methods produce more fluent and human-like text, they may still generate repetitive word sequences.

Recent research indicates that the flaws of Greedy and Beam Search, mainly generating repetitive word sequences, are caused by the model itself rather than the decoding method. Top-k and Top-p sampling also suffer from generating repetitive word sequences, but they often provide better results in open-ended language generation tasks compared to Greedy and Beam Search.

cf. Welleck et al. (2019), Welleck et al. (2020)

Prompt Engineering#

The Career of Future

../../../_images/deepnlp_2_prompt.png

Fig. 97 Prompt Engineering (source)#

With the No-Code revolution on the horizon and the advent of new-age technologies like GPT-3, we may see a significant difference between the careers of today and those of tomorrow…

When designing training prompts, the goal is to obtain a zero-shot response from the model. If that isn’t possible, consider providing a few examples rather than an entire corpus. The standard workflow for training prompt design should follow this order: Zero-Shot → Few-Shot → Corpus-based Priming.

  • Step 1: Clearly define the problem you want to solve and categorize it into one of the possible natural language tasks, such as classification, Q&A, text generation, creative writing, etc.

  • Step 2: Determine if it’s possible to obtain a solution using a zero-shot approach (i.e., without priming the GPT-3 model with any external training examples).

  • Step 3: If you believe that external examples are necessary to prime the model for your use case, revisit Step 2 and carefully reconsider the need for additional examples.

  • Step 4: Consider how the problem might be encountered in a textual format, given GPT-3’s “text-in, text-out” interface. Contemplate various ways to represent your problem using text.

  • Step 5: If you decide to use external examples, use as few as possible while ensuring diversity in your examples. This helps avoid overfitting the model or skewing its predictions.

By following these steps, you can effectively design prompts that address specific problems while minimizing the need for additional training examples, resulting in more accurate and efficient outcomes.