I use a BertTokenizer
and add my custom tokens using add_tokens()
function.
Minimal sample code here:
checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer.add_tokens(["Token1", "Token2"]) # just some samples, I added a million tokens
model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)
training_args = Seq2SeqTrainingArguments(
output_dir = output_model,
evaluation_strategy = "epoch",
optim = "adamw_torch",
eval_steps = 1000,
save_strategy = "epoch",
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01,
save_total_limit = 1,
num_train_epochs = 30,
predict_with_generate=True,
remove_unused_columns=True,
fp16 = True,
metric_for_best_model = "bleu",
load_best_model_at_end = True,
)
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = train_data,
eval_dataset = eval_data,
tokenizer = tokenizer, # I use the tokenizer with added tokens here
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
trainer.push_to_hub(output_model, private=True)
The training process was completed without a problem. But when I use the new model in a pipeline, there is a high chance that the exception: PanicException: AddedVocabulary bad split has occurred. Here is the pipeline code:
text = "Words to translate"
from transformers import pipeline, BertTokenizer
hf_model_name = "my_huggingface_username/" + output_model
translator = pipeline("translation", model=hf_model_name, max_length=200)
print(translator(text)[0]['translation_text'].replace(' ', ''))
I cannot find a pattern and cause of why the exception happens. How can I resolve this PanicException
problem?