Lab: Finetuining a MLM#

Introduction#

In this lab session, we are going to take the previously pretrained BERT model that we created using the Masked Language Modeling (MLM) objective and finetune it for a downstream task: sentiment analysis. Sentiment analysis is a common Natural Language Processing (NLP) task that involves determining the sentiment expressed in a piece of text, usually positive, negative, or neutral.

While pretraining, the model learned to understand the syntax and semantics of the language. Now, during finetuning, we will teach the model how to apply this understanding to a specific task.

Here’s a brief overview of the procedure we will follow:

  1. Preparing the Environment: We’ll start by setting up the necessary libraries and tools in our Google Colab environment.

  2. Loading the Pretrained Model and Tokenizer: Next, we will load our pretrained BERT model and the tokenizer we used during pretraining.

  3. Preparing the Dataset: We will prepare our sentiment analysis dataset by loading it, cleaning the text, and splitting it into training and validation sets.

  4. Tokenizing and Formatting the Data: We’ll then tokenize our data using the previously trained tokenizer and format it for the sentiment analysis task.

  5. Modifying the Model for Sentiment Analysis: Before starting the finetuning process, we need to modify our BERT model to make it suitable for sentiment analysis. We’ll do this by adding a new classification layer on top of BERT.

  6. Training the Model: Once our model is ready, we’ll train it on our sentiment analysis dataset using the Hugging Face’s Trainer class.

  7. Validating the Model: After training, we will validate our model’s performance on the validation dataset.

  8. Saving and Loading the Model: Finally, we’ll save our finetuned model for future use and learn how to load it from disk.

Remember, this whole process is happening within the Google Colab environment, which provides a convenient and powerful platform for running our model training tasks.

Now, let’s dive in and start finetuning our BERT model for sentiment analysis.

Preparing the Environment#

Before we start, we need to set up our working environment in Google Colab. This involves installing necessary libraries and setting up GPU for our computations.

Here are the steps:

  1. Check Python Version

Google Colab should come with Python 3.7 or later pre-installed. You can verify this by running:

!python --version
  1. Installing Hugging Face Transformers Library

We will use the Hugging Face Transformers library for our BERT model and tokenizer. To install it, run:

!pip install transformers
  1. Installing the Hugging Face Datasets Library

We also need the Hugging Face Datasets library for handling our dataset. You can install it by running:

!pip install datasets
  1. Setting up the GPU

Google Colab provides access to a free GPU that we can use to train our models faster. To set up the GPU, follow these steps:

  • Click on the Runtime tab in the Google Colab menu.

  • Select Change runtime type.

  • In the pop-up window, under Hardware accelerator, choose GPU and click Save.

After these steps, your Google Colab environment is ready for training and finetuning models.

Remember to import necessary libraries before starting the next steps:

# %pip install transformers datasets torch
from transformers import (
    BertForMaskedLM,
    BertForSequenceClassification,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset, load_metric
import numpy as np
/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

These libraries will help us in loading the model, the tokenizer, training the model, loading the dataset and evaluating our model’s performance.

Loading the Pretrained Model and Tokenizer#

Now that our environment is prepared, we can load our pretrained BERT model and the tokenizer we trained along with it. These components have been saved from the previous steps we completed.

Here’s how we can do it:

# Load the pretrained BERT model
model = BertForMaskedLM.from_pretrained("../tmp//bert_base_uncased")

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("../tmp/bert_base_uncased")

In the above code, BertForMaskedLM.from_pretrained and BertTokenizerFast.from_pretrained functions are used to load the model and tokenizer, respectively. We specify the path to the directory where we saved the pretrained model and tokenizer, and these functions handle the rest.

Now that we have loaded the pretrained BERT model and tokenizer, we can move forward to prepare our sentiment analysis dataset.

Preparing the Dataset#

For the task of sentiment analysis, we will be using the IMDb dataset which contains movie reviews along with their sentiment polarity (positive/negative). This dataset is widely used for sentiment analysis tasks and is available through the Hugging Face Datasets library.

Here’s how to load and prepare the IMDb dataset:

# Load the IMDb dataset
dataset = load_dataset("imdb")

# Print out the number of items in the train and test sets
print(f"Number of training examples: {len(dataset['train'])}")
print(f"Number of testing examples: {len(dataset['test'])}")
Hide code cell output
Found cached dataset imdb (/home/yj.lee/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 854.88it/s]
Number of training examples: 25000
Number of testing examples: 25000

This will load the IMDb dataset and split it into a training set and a test set.

Next, we split our training data further into training and validation sets:

# Split the training set into train and validation
train_dataset = dataset["train"].train_test_split(test_size=0.1)["train"]
valid_dataset = dataset["train"].train_test_split(test_size=0.1)["test"]
test_dataset = dataset["test"]

Here, we use the train_test_split method to split the original training data into a training set (90%) and a validation set (10%). The testing dataset will remain the same.

In the next section, we’ll preprocess and tokenize this dataset, preparing it for training our sentiment analysis model.

Tokenizing and Formatting the Data#

Now that we have loaded our data, the next step is to preprocess it. This includes tokenizing the text and formatting it so that it can be used as input to our BERT model. As we are performing a sentiment analysis task, we’ll also need to format the labels.

Here’s how we can tokenize and format our data:

# Define a function to tokenize the data
def tokenize_function(element):
    outputs = tokenizer(
        element["text"],
        padding="longest",
        truncation=True,
        max_length=512,
    )
    return outputs


# Tokenize the data
train_dataset = train_dataset.map(tokenize_function, batched=True)
valid_dataset = valid_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)
Loading cached processed dataset at /home/yj.lee/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-82520e7875fc69ed.arrow
# Format the dataset to output PyTorch tensors
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
valid_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.features
{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In this code, we first define a tokenize_function that tokenizes the text, pads or truncates it to a maximum length, and returns the result. We then map this function to our train, validation, and test datasets to tokenize them.

Finally, we set the format of our datasets to output PyTorch tensors. We specify the columns that we want in the output: input_ids, attention_mask, and label.

Now our data is ready to be used for training our sentiment analysis model. In the next section, we’ll modify our BERT model to make it suitable for this task.

Modifying the Model for Sentiment Analysis#

We’ve pretrained a BERT model using a Masked Language Model (MLM) objective and loaded it into memory. However, this model isn’t suitable for sentiment analysis as it is. We need to add a classifier on top of the base BERT model to classify the sentiments.

First, let’s resize the token embeddings of our model. This is necessary because the number of tokens in our pretraining and finetuning tasks might be different:

model.resize_token_embeddings(len(tokenizer))
Embedding(30000, 768, padding_idx=0)

Next, we’ll modify the model for sentiment analysis. The Hugging Face Transformers library makes this easy with the BertForSequenceClassification model. This model is the same as the standard BERT model, but with an added classification layer on top. Here’s how to modify the model:

# Load the model with a sequence classification head
model = BertForSequenceClassification.from_pretrained(
    "../tmp/bert_base_uncased", num_labels=2
)
Some weights of the model checkpoint at ../tmp/bert_base_uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ../tmp/bert_base_uncased and are newly initialized: ['classifier.weight', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In the above code, we’re using the from_pretrained method to load our pretrained model. We also specify num_labels=2 because our sentiment analysis task has two classes: positive and negative.

Our model is now ready to be trained for sentiment analysis. In the next section, we’ll define the training arguments and start the training process.

Training the Model#

Now that our model is set up for sentiment analysis and our data is prepared, we can start training. For this, we’ll use the Trainer class from the Hugging Face Transformers library.

Before we can start training, we need to define some training arguments using the TrainingArguments class:

# Define the training arguments
training_args = TrainingArguments(
    output_dir="../tmp/results",
    num_train_epochs=2,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="../tmp/logs",
)

Here, output_dir is the directory where the training outputs will be saved, num_train_epochs is the number of training epochs, per_device_train_batch_size and per_device_eval_batch_size are the batch sizes for training and evaluation respectively, warmup_steps is the number of warm-up steps, and weight_decay is the weight decay. We also set a logging directory to store logs.

With these training arguments defined, we can now create our Trainer and start training:

# Define a function for computing the metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return load_metric("accuracy").compute(predictions=predictions, references=labels)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()
/home/yj.lee/.local/lib/python3.8/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[ 7/704 00:06 < 15:19, 0.76 it/s, Epoch 0.02/2]
Step Training Loss

TrainOutput(global_step=704, training_loss=0.45673076672987506, metrics={'train_runtime': 932.9201, 'train_samples_per_second': 48.236, 'train_steps_per_second': 0.755, 'total_flos': 1.18399974912e+16, 'train_loss': 0.45673076672987506, 'epoch': 2.0})

In the above code, we first define a compute_metrics function that will be used to compute the accuracy of our model. We then initialize our Trainer with our model, the training arguments, the training and validation datasets, and the compute_metrics function. Finally, we start training with the train method.

This completes the training step. Our model is now finetuned for sentiment analysis. In the next section, we’ll validate our model’s performance on the validation dataset.

Validating the Model#

Now that our model is trained, we need to evaluate its performance on unseen data. We will use our validation dataset for this purpose. We’ll leverage the Trainer’s evaluate method, which takes care of the entire evaluation process.

Here’s how you can do it:

# Evaluate the model
evaluation_results = trainer.evaluate()

# Print the evaluation results
for key, value in evaluation_results.items():
    print(f"{key}: {value:.4f}")
eval_loss: 0.2014
eval_accuracy: 0.9272
eval_runtime: 19.4237
eval_samples_per_second: 128.7090
eval_steps_per_second: 2.0590
epoch: 2.0000

In the above code, the evaluate method returns a dictionary with the evaluation results. We then print out these results. The dictionary usually contains the loss and any metrics that we specified in the compute_metrics function during training.

It’s always a good idea to inspect these results to understand how well our model is doing. Remember that a model that performs well on the training data but poorly on the validation data is probably overfitting. Conversely, a model that performs poorly on both is likely underfitting.

This completes the validation step. In the final section, we’ll save our finetuned model for future use and learn how to load it from disk.

Saving and Loading the Model#

After finetuning our model and validating its performance, we’ll want to save it for future use. We might also want to load it back into memory at a later time. In this section, we’ll see how to do both of these tasks.

Here’s how you can save your model:

# Save the model
model.save_pretrained("../tmp/sentiment_analysis_model")

In the above code, save_pretrained is a method that saves both the model’s weights and its configuration. We specify the directory where we want to save the model.

Now, suppose you want to load this model at a later time. You can do this as follows:

# Load the model
model = BertForSequenceClassification.from_pretrained("../tmp/sentiment_analysis_model")

In this code, we use the from_pretrained method, as we did before, to load our model. We specify the path to the directory where we saved our model, and this method takes care of the rest.

Remember that when you load a model, you also need to load the corresponding tokenizer. You can do this the same way you loaded the tokenizer before.

This completes our guide on finetuning a BERT model for sentiment analysis using Google Colab. You now know how to prepare your environment, load a pretrained model and tokenizer, prepare your dataset, tokenize and format your data, modify your model for sentiment analysis, train your model, validate it, and finally, save and load it.