Purpose:
I tried to build a text classification pipeline using PyTorch and the Huggingface transformers library. The plan was to tokenize text data and combine it with numerical features for training a neural network model. I used custom datasets and a data collator to handle the data, and a Trainer class to manage the training loop.
I expected the model to train successfully, with the Trainer handling the loss computation and training loop seamlessly. However, I encountered an TypeError: cross_entropy_loss(): argument ‘target’ (position 2) must be Tensor, not NoneType error during training. The error message indicated that the labels key was missing in the input batch, which was unexpected since the dataset and data collator were supposed to include it.
Objective:
The pipeline aims to:
- Tokenize text data using a pre-trained tokenizer.
- Feed tokenized text and numerical features to a neural network model.
- Train and evaluate the model using the Huggingface Trainer class.
Implementation:
Dataset:
- A custom dataset class (MinimalTextDataset) is created, which returns tokenized text data along with numerical features and labels.
Data Collator:
- A custom data collator (CustomDataCollator) is used to stack tensors and collate batches of data. It ensures that batches contain input_ids, attention_mask, features, and labels.
Model:
- The model (MinimalModel) is a simple neural network that combines the tokenized text features and numerical features to make predictions.
Trainer:
- A custom trainer (MinimalTrainer) is used to compute the loss and handle the training loop.
Problem:
When running the training pipeline, an TypeError: cross_entropy_loss(): argument ‘target’ (position 2) must be Tensor, not NoneType. This error occurs in the compute_loss function, where the labels key is expected but not found in the input batch. The error suggests that:
The dataset might not be returning the expected ‘labels’ key.
The data collator might not be correctly handling the data.
There might be an unintended modification to the data before reaching the compute_loss function.
Code:
Here’s the code that reproduces the error:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, TrainingArguments
from transformers import Trainer
# Minimal Dataset
class MinimalTextDataset(Dataset):
def __init__(self, texts, features, labels, tokenizer):
self.texts = texts
self.features = features
self.labels = labels
self.tokenizer = tokenizer
def __getitem__(self, idx):
text = self.texts[idx]
features = torch.tensor(self.features[idx], dtype=torch.float32)
labels = torch.tensor(self.labels[idx], dtype=torch.int64)
tokens = self.tokenizer(text, truncation=True, padding='max_length', max_length=128, return_tensors='pt')
tokens = {key: value.squeeze(0) for key, value in tokens.items()}
tokens['features'] = features
tokens['labels'] = labels
return tokens
def __len__(self):
return len(self.labels)
# Custom Data Collator
class CustomDataCollator:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def __call__(self, batch):
features = torch.stack([item['features'] for item in batch])
labels = torch.tensor([item['labels'] for item in batch], dtype=torch.int64)
input_ids = torch.stack([item['input_ids'] for item in batch])
attention_mask = torch.stack([item['attention_mask'] for item in batch])
return {
'features': features,
'labels': labels,
'input_ids': input_ids,
'attention_mask': attention_mask
}
# Minimal Model
class MinimalModel(nn.Module):
def __init__(self):
super(MinimalModel, self).__init__()
self.fc1 = nn.Linear(130, 64)
self.fc2 = nn.Linear(64, 2)
def forward(self, input_ids, attention_mask, features):
x = torch.cat((input_ids.float(), features.float()), dim=1)
x = nn.ReLU()(self.fc1(x))
x = self.fc2(x)
return x
# Minimal Trainer
class MinimalTrainer(Trainer):
def compute_loss(self, model, inputs):
print(f"Debug Inputs: {inputs.keys()}")
labels = inputs.get('labels')
assert labels is not None, "Labels missing from inputs"
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
features = inputs['features']
outputs = model(input_ids=input_ids, attention_mask=attention_mask, features=features)
logits = outputs
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits, labels)
return loss
# Prepare data
train_texts = ["This is a sample text", "Another sample text"]
train_labels = [0, 1]
train_features = [[0.5, 0.3], [0.2, 0.7]]
val_texts = ["Validation text"]
val_labels = [1]
val_features = [[0.6, 0.4]]
# Initialize tokenizer, dataset, and dataloader
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
train_dataset = MinimalTextDataset(train_texts, train_features, train_labels, tokenizer)
val_dataset = MinimalTextDataset(val_texts, val_features, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=2, collate_fn=CustomDataCollator(tokenizer))
val_loader = DataLoader(val_dataset, batch_size=1, collate_fn=CustomDataCollator(tokenizer))
# Train the model
model = MinimalModel()
trainer = MinimalTrainer(
model=model,
args=TrainingArguments(output_dir='./results', per_device_train_batch_size=2, per_device_eval_batch_size=1),
train_dataset=train_loader.dataset,
eval_dataset=val_loader.dataset
)
trainer.train()
Error Message:
When I run the above code, the below error appears, and I could not figure out what is wrong:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[281], line 100
93 model = MinimalModel()
94 trainer = MinimalTrainer(
95 model=model,
96 args=TrainingArguments(output_dir='./results', per_device_train_batch_size=2, per_device_eval_batch_size=1),
97 train_dataset=train_loader.dataset,
98 eval_dataset=val_loader.dataset
99 )
--> 100 trainer.train()
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:1859, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1857 hf_hub_utils.enable_progress_bars()
1858 else:
-> 1859 return inner_training_loop(
1860 args=args,
1861 resume_from_checkpoint=resume_from_checkpoint,
1862 trial=trial,
1863 ignore_keys_for_eval=ignore_keys_for_eval,
1864 )
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:2203, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2200 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2202 with self.accelerator.accumulate(model):
-> 2203 tr_loss_step = self.training_step(model, inputs)
2205 if (
2206 args.logging_nan_inf_filter
2207 and not is_torch_xla_available()
2208 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2209 ):
2210 # if loss is nan or inf simply add the average of previous logged losses
2211 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:3138, in Trainer.training_step(self, model, inputs)
3135 return loss_mb.reduce_mean().detach().to(self.args.device)
3137 with self.compute_loss_context_manager():
-> 3138 loss = self.compute_loss(model, inputs)
3140 if self.args.n_gpu > 1:
3141 loss = loss.mean() # mean() to average on multi-gpu parallel training
Cell In[281], line 72
70 logits = outputs
71 loss_fct = nn.CrossEntropyLoss()
---> 72 loss = loss_fct(logits, labels)
73 return loss
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesmodule.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesmodule.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesloss.py:1185, in CrossEntropyLoss.forward(self, input, target)
1184 def forward(self, input: Tensor, target: Tensor) -> Tensor:
-> 1185 return F.cross_entropy(input, target, weight=self.weight,
1186 ignore_index=self.ignore_index, reduction=self.reduction,
1187 label_smoothing=self.label_smoothing)
File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnfunctional.py:3086, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3084 if size_average is not None or reduce is not None:
3085 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3086 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
TypeError: cross_entropy_loss(): argument 'target' (position 2) must be Tensor, not NoneType
Where should I fix in this code for it to work? I need your help.
Despite my efforts to debug by printing the inputs in the compute_loss function, I couldn’t identify the exact root cause of the issue. The labels seemed to be correctly included when testing the dataset and data collator separately, but they were missing during training.