I have a BertTokenizer
and I added some tokens to it.
<code>from transformers import BertTokenizer, BartForConditionalGeneration
tokenizer = BertTokenizer.from_pretrained("raptorkwok/wordseg-tokenizer")
print(len(tokenizer)) # 245289
print(tokenizer.vocab_size) # 51271
</code>
<code>from transformers import BertTokenizer, BartForConditionalGeneration
tokenizer = BertTokenizer.from_pretrained("raptorkwok/wordseg-tokenizer")
print(len(tokenizer)) # 245289
print(tokenizer.vocab_size) # 51271
</code>
from transformers import BertTokenizer, BartForConditionalGeneration
tokenizer = BertTokenizer.from_pretrained("raptorkwok/wordseg-tokenizer")
print(len(tokenizer)) # 245289
print(tokenizer.vocab_size) # 51271
The 51,271 vocabs are in vocab.txt
, while the added tokens are in added_tokens.json
. If I want to make use of these new tokens in a model, e.g. BartForConditionalGeneration
, how can I apply these new tokens in the model?
For example, I called the model with the codes:
<code>model = BartForConditionalGeneration.from_pretrained("fnlp/bart-base-chinese")
model.resize_token_embeddings(len(tokenizer))
</code>
<code>model = BartForConditionalGeneration.from_pretrained("fnlp/bart-base-chinese")
model.resize_token_embeddings(len(tokenizer))
</code>
model = BartForConditionalGeneration.from_pretrained("fnlp/bart-base-chinese")
model.resize_token_embeddings(len(tokenizer))
The model has a new size of 245,289.
What is my next step in letting the model adopt the new tokens? should I re-train the model?
6