I have a dataset with high class imbalance. So to create proper training and testing sets, I have decided to use StratifiedShuffleSplit and a weighted sampler to improve training.
dataset = ImageFolder(root='/path/to/directory', transform=transform)
#stratified split
ss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, train_size=0.5)
# ss.get_n_splits() #arguments - input features, target labels
for train_idx, test_idx in ss.split(np.zeros(len(dataset.targets)),dataset.targets):
train_dataset = Subset(dataset, train_idx)
test_dataset = Subset(dataset, test_idx)
#sampler
targets_tensor = torch.tensor(dataset.targets, device = device)
targets_array = targets_tensor.cpu().numpy()
class_weights = compute_class_weight('balanced', classes = np.unique(targets_array), y=targets_array)
class_weights = torch.tensor(class_weights, dtype=torch.float32, device = device)
sample_weights = np.zeros(len(targets_array))
for idx, label in enumerate(targets_array):
sample_weights[idx] = class_weights[label]
sample_weights = torch.from_numpy(sample_weights).to(device)
sampler = WeightedRandomSampler(weights=sample_weights, num_samples=len(sample_weights), replacement=True)
train_loader = DataLoader(train_dataset, batch_size=32, sampler = sampler)
test_loader = DataLoader(test_dataset, batch_size=32, sampler = sampler)
Now during training, when I attempt to start the loop, I get this error-
for epoch in range(num_epochs):
3 # running_loss = 0.0
4 i = 0
----> 5 for images,labels in train_loader:
IndexError: index 6057 is out of bounds for axis 0 with size 4544
Whhen I load the entire dataset instead of the subset, it works perfectly.
I think I understand that the Loader is calling the extracted ‘label’ from the train_dataset. But I want it to extract the labels that are present in the train_loader from the parent dataset.
Is there any way I can do this?