I use BertWordPieceTokenizer
(source), trying to train the tokenizer with a predefined vocabulary list, which the list contains 1 – 5 Chinese characters per line and the first 1000 lines are special characters like [PAD]
and [CLS]
.
Here is the code I used:
from tokenizers import BertWordPieceTokenizer
# https://github.com/huggingface/tokenizers/blob/main/bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py
bert_tokenizer = BertWordPieceTokenizer(
"data/bertwordpiece_vocab.txt",
clean_text=True,
handle_chinese_chars=True,
strip_accents=True,
lowercase=True,
)
bert_tokenizer.train(
"data/large_text_file_for_training.txt",
vocab_size = 100000,
min_frequency = 100
)
print(bert_tokenizer)
The output is:
Tokenizer(vocabulary_size=3342, model=BertWordPiece, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], pad_token=[PAD], mask_token=[MASK], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, wordpieces_prefix=##)
Looks normal to me.
Last, to test the ability of the tokenizer, I use the .encode()
function:
bert_tokenizer.encode("我哋今日去睇陳奕迅演唱會").tokens
The output is:
['[CLS]',
'我',
'哋',
'今',
'日',
'去',
'睇',
'陳',
'[UNK]',
'[UNK]',
'演',
'唱',
'會',
'[SEP]']
but the expected result is:
['[CLS]',
'我哋',
'今日',
'去',
'睇',
'陳奕迅',
'演唱會',
'[SEP]']
which the tokens of each expected result are included in the original vocab.txt
. The training process simply just ignores the vocabularies and gives unknown tokens [UNK]
in the encode()
result.
How can I obtain the expected result?