Thiết kế website giá rẻ

Question

Purpose:

I tried to build a text classification pipeline using PyTorch and the Huggingface transformers library. The plan was to tokenize text data and combine it with numerical features for training a neural network model. I used custom datasets and a data collator to handle the data, and a Trainer class to manage the training loop.

I expected the model to train successfully, with the Trainer handling the loss computation and training loop seamlessly. However, I encountered an TypeError: cross_entropy_loss(): argument ‘target’ (position 2) must be Tensor, not NoneType error during training. The error message indicated that the labels key was missing in the input batch, which was unexpected since the dataset and data collator were supposed to include it.

Objective:

The pipeline aims to:

Tokenize text data using a pre-trained tokenizer.
Feed tokenized text and numerical features to a neural network model.
Train and evaluate the model using the Huggingface Trainer class.

Implementation:

Dataset:

A custom dataset class (MinimalTextDataset) is created, which returns tokenized text data along with numerical features and labels.

Data Collator:

A custom data collator (CustomDataCollator) is used to stack tensors and collate batches of data. It ensures that batches contain input_ids, attention_mask, features, and labels.

Model:

The model (MinimalModel) is a simple neural network that combines the tokenized text features and numerical features to make predictions.

Trainer:

A custom trainer (MinimalTrainer) is used to compute the loss and handle the training loop.

Problem:

When running the training pipeline, an TypeError: cross_entropy_loss(): argument ‘target’ (position 2) must be Tensor, not NoneType. This error occurs in the compute_loss function, where the labels key is expected but not found in the input batch. The error suggests that:

The dataset might not be returning the expected ‘labels’ key.
The data collator might not be correctly handling the data.
There might be an unintended modification to the data before reaching the compute_loss function.

Code:

Here’s the code that reproduces the error:

import torch  
import torch.nn as nn  
from torch.utils.data import Dataset, DataLoader  
from transformers import AutoTokenizer, TrainingArguments  
from transformers import Trainer  
  
# Minimal Dataset  
class MinimalTextDataset(Dataset):  
    def __init__(self, texts, features, labels, tokenizer):  
        self.texts = texts  
        self.features = features  
        self.labels = labels  
        self.tokenizer = tokenizer  
  
    def __getitem__(self, idx):  
        text = self.texts[idx]  
        features = torch.tensor(self.features[idx], dtype=torch.float32)  
        labels = torch.tensor(self.labels[idx], dtype=torch.int64)  
  
        tokens = self.tokenizer(text, truncation=True, padding='max_length', max_length=128, return_tensors='pt')  
        tokens = {key: value.squeeze(0) for key, value in tokens.items()}  
        tokens['features'] = features  
        tokens['labels'] = labels  
        return tokens  
  
    def __len__(self):  
        return len(self.labels)  
  
# Custom Data Collator  
class CustomDataCollator:  
    def __init__(self, tokenizer):  
        self.tokenizer = tokenizer  
  
    def __call__(self, batch):  
        features = torch.stack([item['features'] for item in batch])  
        labels = torch.tensor([item['labels'] for item in batch], dtype=torch.int64)  
        input_ids = torch.stack([item['input_ids'] for item in batch])  
        attention_mask = torch.stack([item['attention_mask'] for item in batch])  
  
        return {  
            'features': features,  
            'labels': labels,  
            'input_ids': input_ids,  
            'attention_mask': attention_mask  
        }  
  
# Minimal Model  
class MinimalModel(nn.Module):  
    def __init__(self):  
        super(MinimalModel, self).__init__()  
        self.fc1 = nn.Linear(130, 64)  
        self.fc2 = nn.Linear(64, 2)  
  
    def forward(self, input_ids, attention_mask, features):  
        x = torch.cat((input_ids.float(), features.float()), dim=1)  
        x = nn.ReLU()(self.fc1(x))  
        x = self.fc2(x)  
        return x  
  
# Minimal Trainer  
class MinimalTrainer(Trainer):  
    def compute_loss(self, model, inputs):  
        print(f"Debug Inputs: {inputs.keys()}")  
        labels = inputs.get('labels')  
        assert labels is not None, "Labels missing from inputs"  
        input_ids = inputs['input_ids']  
        attention_mask = inputs['attention_mask']  
        features = inputs['features']  
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, features=features)  
        logits = outputs  
        loss_fct = nn.CrossEntropyLoss()  
        loss = loss_fct(logits, labels)  
        return loss  
  
# Prepare data  
train_texts = ["This is a sample text", "Another sample text"]  
train_labels = [0, 1]  
train_features = [[0.5, 0.3], [0.2, 0.7]]  
  
val_texts = ["Validation text"]  
val_labels = [1]  
val_features = [[0.6, 0.4]]  
  
# Initialize tokenizer, dataset, and dataloader  
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")  
train_dataset = MinimalTextDataset(train_texts, train_features, train_labels, tokenizer)  
val_dataset = MinimalTextDataset(val_texts, val_features, val_labels, tokenizer)  
  
train_loader = DataLoader(train_dataset, batch_size=2, collate_fn=CustomDataCollator(tokenizer))  
val_loader = DataLoader(val_dataset, batch_size=1, collate_fn=CustomDataCollator(tokenizer))  
  
# Train the model  
model = MinimalModel()  
trainer = MinimalTrainer(  
    model=model,  
    args=TrainingArguments(output_dir='./results', per_device_train_batch_size=2, per_device_eval_batch_size=1),  
    train_dataset=train_loader.dataset,  
    eval_dataset=val_loader.dataset  
)  
trainer.train()

Error Message:

When I run the above code, the below error appears, and I could not figure out what is wrong:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[281], line 100
     93 model = MinimalModel()  
     94 trainer = MinimalTrainer(  
     95     model=model,  
     96     args=TrainingArguments(output_dir='./results', per_device_train_batch_size=2, per_device_eval_batch_size=1),  
     97     train_dataset=train_loader.dataset,  
     98     eval_dataset=val_loader.dataset  
     99 )  
--> 100 trainer.train()  

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:1859, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1857         hf_hub_utils.enable_progress_bars()
   1858 else:
-> 1859     return inner_training_loop(
   1860         args=args,
   1861         resume_from_checkpoint=resume_from_checkpoint,
   1862         trial=trial,
   1863         ignore_keys_for_eval=ignore_keys_for_eval,
   1864     )

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:2203, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2200     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2202 with self.accelerator.accumulate(model):
-> 2203     tr_loss_step = self.training_step(model, inputs)
   2205 if (
   2206     args.logging_nan_inf_filter
   2207     and not is_torch_xla_available()
   2208     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2209 ):
   2210     # if loss is nan or inf simply add the average of previous logged losses
   2211     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestransformerstrainer.py:3138, in Trainer.training_step(self, model, inputs)
   3135     return loss_mb.reduce_mean().detach().to(self.args.device)
   3137 with self.compute_loss_context_manager():
-> 3138     loss = self.compute_loss(model, inputs)
   3140 if self.args.n_gpu > 1:
   3141     loss = loss.mean()  # mean() to average on multi-gpu parallel training

Cell In[281], line 72
     70 logits = outputs  
     71 loss_fct = nn.CrossEntropyLoss()  
---> 72 loss = loss_fct(logits, labels)  
     73 return loss

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesmodule.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesmodule.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnmodulesloss.py:1185, in CrossEntropyLoss.forward(self, input, target)
   1184 def forward(self, input: Tensor, target: Tensor) -> Tensor:
-> 1185     return F.cross_entropy(input, target, weight=self.weight,
   1186                            ignore_index=self.ignore_index, reduction=self.reduction,
   1187                            label_smoothing=self.label_smoothing)

File c:UsersbsedefDesktopsimay_hanim_v2.envlibsite-packagestorchnnfunctional.py:3086, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3084 if size_average is not None or reduce is not None:
   3085     reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3086 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

TypeError: cross_entropy_loss(): argument 'target' (position 2) must be Tensor, not NoneType

Where should I fix in this code for it to work? I need your help.

Despite my efforts to debug by printing the inputs in the compute_loss function, I couldn’t identify the exact root cause of the issue. The labels seemed to be correctly included when testing the dataset and data collator separately, but they were missing during training.

Danh mục