I know that SentencePiece is an open-source tokenization framework from Google, which includes tokenization algorithms like BPE, Unigram, etc. Algorithms such as BPE, WordPiece, and Unigram typically require an pre-tokenization step on the raw text during training. However, SentencePiece claims that this step is unnecessary—it simply replaces spaces with special characters and treats the text as a continuous sequence of Unicode characters.
My question is, since SentencePiece doesn’t perform pre-tokenization, how does it carry out BPE training? I have gone through many tutorial websites and blogs, and while they teach how to use the tool, they don’t explain the specific processing flow of SentencePiece.
Thank you in advance for your help.