I am trying to train a model for image segmentation using Pytorch Lightning. I created a custom Dataset used by a LightningDataModule. The custom Dataset loads a dataset from HuggingFace that contains images with associated masks (targets) and I mostly use this Dataset to apply the transformations I want for the model.
The issue I have is that when training the model using the Trainer of Pytorch Lightning, it seems that the Dataset cannot apply its transformations because the getitem directly get batches with a size given in the DataLoader. Indeed, when printing the item, I get an array of PIL images instead of only a PIL image :
[<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=600x784 at 0x782ACE8C77C0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1972x2877 at 0x782ACE8C6BF0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=602x825 at 0x782ACE8C4FA0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=636x882 at 0x782ACE8C41C0>]
The length of this array is equal to the batch size I give my DataLoader.
The way my Dataset accesses its elements is simply by using their index on an array defined as an attribute in the init. I do not understand how, but it seems that the DataLoader is able to collate and create batches on this array (called self.dataset in the Custom Dataset) even though it should only be an attribute.
It doesn’t seem the getitem from the Dataset is being used since there are transformations defined in the function that should format the PIL images into tensors.
I could probably get around the problem by making my getitem process the items as batches by looping in the array instead of processing single items as normal, and another way to fix this would probably be to apply the transformations in the collate_fn but I feels like it would be a bad practice and I would like to be able to use the DataLoader properly.
The official documentation about Dataset and DataLoader applies the transformation in the getitem and is able to do so because it gets access to each element of the Dataset using their index
This is my implementation of the custom Dataset. I simply put the Dataset in a DataLoader in the train_dataloader function of the LightningModule :
class SegmentationDataset(Dataset):
def __init__(self, datacard="skytnt/anime-segmentation", directory="imgs-masks", split="train", transforms=None):
load_data = load_dataset(datacard, directory, trust_remote_code=True)
train_set, valid_set, test_set = random_split(load_data["train"], [0.7, 0.2, 0.1])
self.transforms = transforms
if split=="train":
self.dataset = train_set
if split=="valid":
self.dataset = valid_set
if split=="test":
self.dataset = test_set
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
input, target = self.dataset[idx]["image"], self.dataset[idx]["mask"]
print(input) # This gives an array of PIL instead of just a PIL
print(len(input)) # The length of the input is equal to the batch size
# transforms
if self.transforms:
input, target = self.transforms(input), self.transforms(target)
# binarize masks
target = (target > 0).float()
return input, target
Thank you in advance
user25565229 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.