I want to train a custom tokenizer from scratch. Following some online tutorials, they suggest adding a series of special tokens to the train_from_iterator()
function:
special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>", "<s>", "</s>"]
tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
get_training_corpus(),
vocab_size = 100000,
min_frequency = 1000, # min number of occurrence before adding into vocab list
special_tokens = special_tokens
)
My question: Is there any specific order to list the tokens? How does the SentencePieceBPETokenizer
map the special tokens with corresponding string, for example: unk_token
maps with <unk>
, cls_token
maps with <cls>
in the special_tokens
list.
Working with the HuggingFace example run_translation.py
, I tried to compare the output file of special_tokens_map.json
of a BART base tokenizer and my custom tokenizer, and found out that the structure is very different. See below:
Left: BART Base Tokenizer (from HuggingFace fnlp/bart-base-chinese
); Right: my custom tokenizer generated by the SentencePieceBPETokenizer.train_from_iterator()
How can my custom tokenizer generate the same special_tokens_map.json
as the BART base tokenizer? The generated config.json
using both tokenizers are almost identical, except for the vocab_size
.
Tracing back the source code, the BpeTrainer
class in the Rust language does not specify how the training process maps the tokens.
Or, shouldn’t I use SentencePieceBPETokenizer
?