I am trying to do a machine translation from Hindi to Sanskrit using NLLB model. But I get the below warning, and the training does not progress:
UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all '
- The warning is coming when training the pretrained NLLB model `facebook/nllb-200-1.3B
- The input data is ~40k Hindi sentences.
Detailed warning:
/home//.conda/envs/dict/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
0%| | 0/4968 [00:10<?, ?it/s]
/home//.conda/envs/dict/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0%| | 1/9936 [00:03<9:49:55, 3.56s/it]
If you notice above, the training does not progress after 0% and just stays the same. The terminal gets hanged and Ctrl-C does not work.
The code of the preprocessing done for the data:
def preprocess_function(examples):
inputs = [example + ' </s>' + f' <2{s_lang}>' for example in examples[source_lang]]
targets = [f'<2{t_lang}> ' + example + ' </s>' for example in examples[target_lang]]
model_inputs = tokenizer.batch_encode_plus(inputs, max_length=max_input_length, truncation=True,padding='max_length')
with tokenizer.as_target_tokenizer():
labels = tokenizer.batch_encode_plus(targets, max_length=max_input_length, truncation=True,padding='max_length')
model_inputs['labels'] = labels['input_ids']
return model_inputs
Data after preprocessing and tokenisation:
DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 39729
})
val: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2210
})
test: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2214
})
})
The code of model params and training:
training_args = Seq2SeqTrainingArguments(
evaluation_strategy="epoch",
save_strategy='epoch',
learning_rate=2e-5,
auto_find_batch_size=True,
output_dir="./output_dir",
weight_decay=0.01,
save_total_limit=1,
num_train_epochs=4,
predict_with_generate=True,
fp16=False,
push_to_hub=False,
remove_unused_columns = False,
)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset['train'],
data_collator=data_collator,
compute_metrics=compute_metrics,
)
print("nStarting trainingn")
# torch.cuda.empty_cache()
print(trainer.train())
Any idea why this warning is coming and why the training isn’t happening?
user27310271 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.