I want to classify text paragraphs into one of 3 categories through a RobBERT-v2-dutch-base classification model. Labels are defined as float64’s and are either 0, 1, or 2. I’m getting a ValueError when i call the .train() function on my Trainer object.
The data is within a DatasetDict, with a “train”, “validation” and “test” dataset. First, I tokenize by calling:
def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True)
and mapping it onto the dataset data_encoded = data.map(tokenize, batched=True, batch_size=None)
Then, I convert the dataset to a torch format:
data_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
I set up the trainer like so:
from transformers import Trainer, TrainingArguments
batch_size = 16
logging_steps = len(data_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-sentiment"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
push_to_hub=False,
log_level="error")
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics,
train_dataset=data_encoded["train"],
eval_dataset=data_encoded["validation"],
tokenizer=tokenizer)
trainer.train();
When I run this I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-12-126d3830a6aa> in <cell line: 8>()
6 eval_dataset=data_encoded["validation"],
7 tokenizer=tokenizer)
----> 8 trainer.train();
10 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
3222
3223 if not (target.size() == input.size()):
-> 3224 raise ValueError(f"Target size ({target.size()}) must be the same as input size ({input.size()})")
3225
3226 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 3]))
I already looked at the compute metrics function, but removing that didn’t change the outcome. I’m currently at a loss, and don’t know how to troubleshoot this.