I try to create a simple SentencePieceBPETokenizer
without training.
from tokenizers import SentencePieceBPETokenizer
special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
test_tokenizer = SentencePieceBPETokenizer(unk_token="<unk>", replacement = "▁")
test_tokenizer.add_special_tokens(special_tokens)
print(test_tokenizer.token_to_id("<unk>")) # print 0
It makes sense so far, then I wrap the above tokenizer with PreTrainedTokenizerFast
:
from transformers import PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=test_tokenizer,
unk_token="<unk>",
pad_token="<pad>",
cls_token="<cls>",
sep_token="<sep>",
mask_token="<mask>",
padding_side="left",
)
wrapped_tokenizer.push_to_hub('tokenizer-test', private=True)
and use it like this:
tokenizer_test = AutoTokenizer.from_pretrained("myHFusername/tokenizer-test")
print(tokenizer_test.unk_token_id) # print 0
so far it is correct too. But when I try to encode a paragraph, since the Tokenizer does not contain tokens other than those special tokens, all words should be tokenized to the unknown token.
print(tokenizer_test.encode_plus("I am a boy.", add_special_tokens=True))
However, it raises an exception:
Exception: Unk token
<unk>
not found in the vocabulary
Then, I double check whether the Unknown Token exists by:
print(tokenizer_test.vocab)
which prints: {'<unk>': 0, '<cls>': 2, '<pad>': 1, '<mask>': 4, '<sep>': 3}
Unknown tokens exist!
My question is: how do I resolve this issue?