For SentencePieceBPETokenizer, Exception: Unk token “ not found in the vocabulary
I try to create a simple SentencePieceBPETokenizer
without training.
How does SentencePieceBPETokenizer choose tokens from dataset during training?
I have a dataset containing approximately 30 million Cantonese (a variant of Chinese language) sentences. From my implementation of SentencePieceBPETokenizer
, I have added a pre-tokenizer which splits Cantonese characters into words. Unlike English, Cantonese or Chinese language does not use spaces to separate words; these languages combine multiple Cantonese/Chinese characters into a word (vocabulary).