Lab: Training Tokenizers#

In this lab lecture, we will focus on training various tokenizers on a corpus of Korean text. The corpus we are using is the wiki dataset from the Hugging Face Hub. We will be training Byte Pair Encoding (BPE), WordPiece, and Unigram tokenizers, and then compare their performance. Additionally, we will also train a tokenizer using SentencePiece.

Step 1: Install necessary libraries#

First, we need to install the necessary libraries. We’ll be using Hugging Face’s tokenizers library and the datasets library to load our corpus. We’ll also need the sentencepiece library. You can install them with:

%pip install tokenizers datasets sentencepiece

Step 2: Load the dataset#

We’ll be using the wiki dataset in Korean, which we can load using the datasets library.

from datasets import load_dataset

wiki = load_dataset("lcw99/wikipedia-korean-20221001")
Hide code cell output
/home/yj.lee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Downloading metadata: 100%|██████████| 1.34k/1.34k [00:00<00:00, 5.92MB/s]
Downloading readme: 100%|██████████| 22.0/22.0 [00:00<00:00, 110kB/s]
Downloading and preparing dataset wikipedia/20221001.ko (download: 690.36 MiB, generated: 1.22 GiB, post-processed: Unknown size, total: 1.89 GiB) to /home/yj.lee/.cache/huggingface/datasets/lcw99___parquet/lcw99--wikipedia-korean-20221001-91be346ce97972f5/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data: 100%|██████████| 375M/375M [00:24<00:00, 15.1MB/s]
Downloading data: 100%|██████████| 188M/188M [00:13<00:00, 14.3MB/s]
Downloading data: 100%|██████████| 162M/162M [00:22<00:00, 7.24MB/s]
Downloading data files: 100%|██████████| 1/1 [01:10<00:00, 70.74s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 799.22it/s]
                                                                                          
Dataset parquet downloaded and prepared to /home/yj.lee/.cache/huggingface/datasets/lcw99___parquet/lcw99--wikipedia-korean-20221001-91be346ce97972f5/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
100%|██████████| 1/1 [00:00<00:00, 117.21it/s]

Step 3: Prepare the text#

We’ll need to extract the actual text from our dataset for training our tokenizers.

# change this to your own path
wiki_filepath = "../tmp/wiki.txt"

text = "\n".join(article["text"] for article in wiki["train"])
with open(wiki_filepath, "w", encoding="utf-8") as f:
    f.write(text)

Step 4: Train the tokenizers#

Now, we’ll train each of our tokenizers on our text.

Byte Pair Encoding (BPE)#

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace


# change this to your own path
bpe_tokenizer_path = "../tmp/bpe_tokenizer.json"

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train([wiki_filepath])
tokenizer.save(bpe_tokenizer_path)

WordPiece#

from tokenizers.models import WordPiece

# change this to your own path
wordpiece_tokenizer_path = "../tmp/wordpiece_tokenizer.json"

tokenizer = Tokenizer(WordPiece())
tokenizer.train([wiki_filepath])
tokenizer.save(wordpiece_tokenizer_path)

Unigram#

from tokenizers.models import Unigram

# change this to your own path
unigram_tokenizer_path = "../tmp/unigram_tokenizer.json"

tokenizer = Tokenizer(Unigram())
tokenizer.train([wiki_filepath])
tokenizer.save(unigram_tokenizer_path)

SentencePiece#

For SentencePiece, we’ll use the SentencePiece library directly.

import sentencepiece as spm

sentencepiece_tokenizer_path = "../tmp/sentencepiece_tokenizer.model"
num_threads = 1

spm.SentencePieceTrainer.train(
    "--input={} --model_prefix=sentencepiece --vocab_size=32000 --num_threads={}".format(wiki_filepath, num_threads)
)
!mv sentencepiece.* ./tmp/
Hide code cell output
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./tmp/wiki.txt --model_prefix=sentencepiece --vocab_size=32000 --num_threads=40
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./tmp/wiki.txt
  input_format: 
  model_prefix: sentencepiece
  model_type: UNIGRAM
  vocab_size: 32000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 40
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: ./tmp/wiki.txt
trainer_interface.cc(378) LOG(WARNING) Found too long line (4536 > 4192).
trainer_interface.cc(380) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(381) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(145) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 3000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 4000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 5000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 6000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 7000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 8000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 9000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 10000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 11000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 12000000 lines
trainer_interface.cc(145) LOG(INFO) Loaded 13000000 lines
trainer_interface.cc(122) LOG(WARNING) Too many sentences are loaded! (13749557), which may slow down training.
trainer_interface.cc(124) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
trainer_interface.cc(127) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
trainer_interface.cc(407) LOG(INFO) Loaded all 13749557 sentences
trainer_interface.cc(414) LOG(INFO) Skipped 369 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=551013136
trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=4699
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 13441046 sentences.
unigram_model_trainer.cc(247) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(251) LOG(INFO) Extracting frequent sub strings... node_num=224525048
unigram_model_trainer.cc(301) LOG(INFO) Initialized 1004699 seed sentencepieces
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 13441046
trainer_interface.cc(608) LOG(INFO) Done! 11688710
unigram_model_trainer.cc(607) LOG(INFO) Using 11688710 sentences for EM training
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=773874 obj=20.4511 num_tokens=38686257 num_tokens/piece=49.9904
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=723238 obj=15.2004 num_tokens=38170992 num_tokens/piece=52.7779
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=460480 obj=15.2601 num_tokens=41257696 num_tokens/piece=89.5972
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=459793 obj=15.1461 num_tokens=41280580 num_tokens/piece=89.7808
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=344844 obj=15.2885 num_tokens=43137869 num_tokens/piece=125.094
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=344843 obj=15.2281 num_tokens=43142897 num_tokens/piece=125.109
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=258632 obj=15.4073 num_tokens=45086624 num_tokens/piece=174.327
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=258632 obj=15.3484 num_tokens=45087203 num_tokens/piece=174.33
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=193974 obj=15.5576 num_tokens=46726298 num_tokens/piece=240.889
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=193974 obj=15.5026 num_tokens=46726287 num_tokens/piece=240.889
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=145480 obj=15.7268 num_tokens=48196596 num_tokens/piece=331.294
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=145480 obj=15.6766 num_tokens=48196650 num_tokens/piece=331.294
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=109110 obj=15.9223 num_tokens=49621655 num_tokens/piece=454.786
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=109110 obj=15.872 num_tokens=49621543 num_tokens/piece=454.785
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=81832 obj=16.1363 num_tokens=51032940 num_tokens/piece=623.631
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=81832 obj=16.084 num_tokens=51033150 num_tokens/piece=623.633
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=61374 obj=16.373 num_tokens=52469102 num_tokens/piece=854.908
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=61374 obj=16.314 num_tokens=52470264 num_tokens/piece=854.927
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=46030 obj=16.6282 num_tokens=53971634 num_tokens/piece=1172.53
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=46030 obj=16.5661 num_tokens=53972222 num_tokens/piece=1172.54
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=0 size=35200 obj=16.8836 num_tokens=55349142 num_tokens/piece=1572.42
unigram_model_trainer.cc(623) LOG(INFO) EM sub_iter=1 size=35200 obj=16.8155 num_tokens=55349925 num_tokens/piece=1572.44
trainer_interface.cc(686) LOG(INFO) Saving model: sentencepiece.model
trainer_interface.cc(698) LOG(INFO) Saving vocabs: sentencepiece.vocab

Step 5: Compare the tokenizers#

Now, let’s load each tokenizer and see how they tokenize a sample sentence.

from tokenizers import Tokenizer
import sentencepiece as spm

sample_sentence = "안녕하세요. 이 문장은 토크나이저를 테스트하기 위한 샘플 문장입니다."

bpe_tokenizer_path = "../tmp/bpe_tokenizer.json"
wordpiece_tokenizer_path = "../tmp/wordpiece_tokenizer.json"
unigram_tokenizer_path = "../tmp/unigram_tokenizer.json"
sentencepiece_tokenizer_path = "../tmp/sentencepiece_tokenizer.model"

# BPE
bpe = Tokenizer.from_file(bpe_tokenizer_path)
print("BPE:", bpe.encode(sample_sentence).tokens)

# WordPiece
wordpiece = Tokenizer.from_file(wordpiece_tokenizer_path)
print("WordPiece:", wordpiece.encode(sample_sentence).tokens)

# Unigram
unigram = Tokenizer.from_file(unigram_tokenizer_path)
print("Unigram:", unigram.encode(sample_sentence).tokens)

# SentencePiece
spm = spm.SentencePieceProcessor()
spm.load(sentencepiece_tokenizer_path)
print("SentencePiece:", spm.encode_as_pieces(sample_sentence))
BPE: ['안', '녕', '하', '세', '요', '.', '이', '문', '장은', '토', '크', '나이', '저', '를', '테', '스트', '하기', '위한', '샘', '플', '문', '장', '입', '니다', '.']
WordPiece: ['안', '##녕', '##하', '##세', '##요', '##.', '## ', '##이', '## ', '##문', '##장', '##은', '## ', '##토', '##크', '##나', '##이', '##저', '##를', '## ', '##테', '##스', '##트', '##하', '##기', '## ', '##위', '##한', '## ', '##샘', '##플', '## ', '##문', '##장', '##입', '##니', '##다', '##.']
SentencePiece: ['▁', '안녕하세요', '.', '▁이', '▁문장', '은', '▁토크', '나', '이', '저', '를', '▁테스트', '하기', '▁위한', '▁샘플', '▁문장', '입니다', '.']

You should see that each tokenizer breaks down the sentence differently. Some might keep “안녕하세요” as one token, while others might break it down further. This is the core difference between these tokenization strategies.

Remember, there is no universally “best” tokenizer—it depends on the specific task and language. Try using these different

tokenizers in your models and see which one works best for your task!

Training Tokenizers for GPT, BERT, and T5#

In this section, we will train tokenizers using the same specifications as the GPT, BERT, and T5 models. These tokenizers are Byte Pair Encoding (BPE) for GPT, WordPiece for BERT, and Unigram for T5.

Byte Pair Encoding (BPE) - GPT#

GPT models use the Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 50,000. BPE was initially designed for data compression, but it has been shown to work well for tokenizing text in neural language models.

To train a BPE tokenizer with the same specifications as GPT:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

wiki_filepath = "../tmp/wiki.txt"

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(
    vocab_size=50000,
    min_frequency=2,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
)
tokenizer.train([wiki_filepath], trainer)
tokenizer.save("../tmp/bpe_gpt_tokenizer.json")

WordPiece - BERT#

BERT models use the WordPiece tokenizer with a vocabulary size of 30,000. WordPiece is a data-driven tokenization strategy that allows for better handling of out-of-vocabulary (OOV) words.

To train a WordPiece tokenizer with the same specifications as BERT:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=False)

trainer = WordPieceTrainer(
    vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
tokenizer.train([wiki_filepath], trainer)
tokenizer.save("../tmp/wordpiece_bert_tokenizer.json")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 6
      3 from tokenizers.trainers import WordPieceTrainer
      5 tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
----> 6 tokenizer.normalizer = BertNormalizer(lowercase=False)
      8 trainer = WordPieceTrainer(
      9     vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
     10 )
     11 tokenizer.train([wiki_filepath], trainer)

NameError: name 'BertNormalizer' is not defined

Unigram - T5#

T5 models use the Unigram tokenizer with a vocabulary size of 32,000. The Unigram tokenizer is a subword regularization algorithm that learns a vocabulary based on the frequency of subwords in the training text.

To train a Unigram tokenizer with the same specifications as T5:

from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=32000, special_tokens=["<pad>", "</s>", "<unk>", "<s>"]
)
tokenizer.train(["wiki_ko.txt"], trainer)
tokenizer.save("unigram_t5_tokenizer.json")

In summary, we have trained three tokenizers using the same specifications as the GPT, BERT, and T5 models. BPE is used in GPT with a vocabulary size of 50,000, WordPiece is used in BERT with a vocabulary size of 30,000, and Unigram is used in T5 with a vocabulary size of 32,000. These tokenizers are all data-driven and provide different ways to handle subword tokenization, which can be beneficial for handling out-of-vocabulary words and improving the performance of the models.