I have a dataset containing approximately 30 million Cantonese (a variant of Chinese language) sentences. From my implementation of SentencePieceBPETokenizer
, I have added a pre-tokenizer which splits Cantonese characters into words. Unlike English, Cantonese or Chinese language does not use spaces to separate words; these languages combine multiple Cantonese/Chinese characters into a word (vocabulary).
The pre-tokenizer works fine and correctly segments a sentence into various words (vocabulary), such as:
我哋今日去睇陳奕迅演唱會 --> ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會']
(The sentence translates to: “We go to watch Eason Chan’s concert today.” p.s. Eason Chan (陳奕迅) is a Hong Kong-based singer.)
However, after training with the codes below, the sentence is incorrectly tokenized to:
['我哋', '今日', '去', '睇', '陳', '<unk>', '<unk>', '演唱會']
which the singer’s name is not in the token list.
After noticing this issue, I searched the dataset to check how many occurrences of the term “陳奕迅” exist.
- 陳:135,557 occurrences (in the token list)
- 奕:5,492 occurrences (not in the token list)
- 迅:13,520 occurrences (not in the token list)
- 陳奕迅:2,861 occurrences (not in the token list)
On the other hand, I study the term “演唱會” (means concert) too:
- 演:136,425 occurrences (in the token list)
- 唱:147,882 occurrences (in the token list)
- 會:3,763,791 occurrences (in the token list)
- 演唱:17,244 occurrences (in the token list)
- 演唱會:16,206 occurrences (in the token list)
My question is: How does SentencePieceBPETokenizer
choose tokens from the word-segmented dataset during training? As I set the min_frequency = 100
, supposedly all words with a minimum of 100 occurrences should be added to the token list. The total number of tokens is 37,976, but the vocab_size
is configured to 400,000, which should fit all possible tokens into the vocab list. How come the term “陳奕迅” does not appear in the list?
Here are my minimal training codes:
def get_training_corpus(batch_size=1000):
for i in range(0, len(mega_list), batch_size):
yield mega_list[i : i + batch_size]
from tokenizers import SentencePieceBPETokenizer
special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
tokenizer = SentencePieceBPETokenizer()
tokenizer.pre_tokenizer = PreTokenizer.custom(CantonesePreTokenizer())
tokenizer.train_from_iterator(
get_training_corpus(),
vocab_size = 400000,
min_frequency = 100,
special_tokens = special_tokens
)
which mega_list
is a list of 30 million Cantonese sentences. The dataset is available here.
The logic is similar to this official example.