Relative Content

Tag Archive for pythonhuggingface-tokenizerssentencepiece

How does SentencePieceBPETokenizer choose tokens from dataset during training?

I have a dataset containing approximately 30 million Cantonese (a variant of Chinese language) sentences. From my implementation of SentencePieceBPETokenizer, I have added a pre-tokenizer which splits Cantonese characters into words. Unlike English, Cantonese or Chinese language does not use spaces to separate words; these languages combine multiple Cantonese/Chinese characters into a word (vocabulary).