I want to create a function, that takes data instances, labels, and a target-proportion.
The function should determine the proportion of classes in the given dataset/labels, and resample the data into the given target proportion using either imblearn.over_sampling.SMOTE
for classes that require over-sampling to reach its target-proportion, or the imblearn.under_sampling.RandomUnderSampler
for classes that need be under-sampled to make it equal to target-proportion.
For example, given:
X, y = make_classification(n_samples=10000, n_features=10, n_classes=5,
n_informative=4, weights=[0.3,0.125,0.239,0.153,0.188])
X_train,X_test,y_train,y_test=train_test_split(X,y, random_state=42)
Initially, the proportion is:
class 0: 0.3,
class 1: 0.125,
class 2: 0.239,
class 3: 0.153,
class 4: 0.188
And we want to get the following target proportion:
class 0: 0.519
class 1: 0.373
class 2: 0.226
class 3: 0.053
class 4: 0.164
The function should determine when to use SMOTE
or RandomUndersampler
.
My code:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
def resample_to_proportion(X, y, target_proportion):
# Calculate the current proportion of each class
class_counts = Counter(y)
total_samples = len(y)
current_proportion = {label: count / total_samples for label, count in class_counts.items()}
# Initialize resampling strategies
resampling_strategies = {}
for label, target_prop in target_proportion.items():
if target_prop > current_proportion[label]:
resampling_strategies[label] = SMOTE(sampling_strategy=target_prop)
elif target_prop < current_proportion[label]:
resampling_strategies[label] = RandomUnderSampler(sampling_strategy=target_prop)
# Resample each class based on the difference between current and target proportion
X_resampled = []
y_resampled = []
for label, strategy in resampling_strategies.items():
mask = y == label
X_class = X[mask]
y_class = y[mask]
X_resampled_class, y_resampled_class = strategy.fit_resample(X_class, y_class)
X_resampled.append(X_resampled_class)
y_resampled.append(y_resampled_class)
# Concatenate resampled data
X_resampled = np.concatenate(X_resampled)
y_resampled = np.concatenate(y_resampled)
return X_resampled, y_resampled
But then:
target_proportion = {0: 0.519, 1: 0.373, 2: 0.226, 3: 0.053, 4: 0.164}
X_resampled, y_resampled = resample_to_proportion(X_train, y_train, target_proportion)
ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead