From what I have understood from the hugging-face doc is that we should use a pretrained model with its own tokenizer for a good performance. My doubt is that, BERT uses wordpiece but RoBERTA (again BERT architecture) uses BPE as tokenization approaches. Can we mix match any model and tokenizer if we are pretraining the model from scratch, like in the case of RoBERTA ? Can I pretrain the BERT/DistiBert model from scratch using BPE/Unigram tokenizer? Is the rule of using same model with same tokenizer applicable on finetuning or inference purpose only? Is the architecture of each model related to the tokenizing approach or not ?
I am trying to train a distilbert using unigram approach.
Naima Vahab is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.