Relative Content

Tag Archive for tokenizesentencepiece

How is SentencePiece trained?

I know that SentencePiece is an open-source tokenization framework from Google, which includes tokenization algorithms like BPE, Unigram, etc. Algorithms such as BPE, WordPiece, and Unigram typically require an pre-tokenization step on the raw text during training. However, SentencePiece claims that this step is unnecessary—it simply replaces spaces with special characters and treats the text as a continuous sequence of Unicode characters.