How can I use SMOTE upsampling with a categorical variable that is used only for stratification in an MLR3 pipeline?
I am working on a binary classification model using an MLR3 pipeline.
The data set is imbalanced and I am using SMOTE to compensate for this.
The predictors are all numerical, but there are two categorical variables that are used only for stratification, along with the outcome variable, for the purposes of train/test splitting and nested cross-validation splitting.
The problem I have encountered is that SMOTE cannot insert extra data with the categorical variables present, even though it doesn’t use them.
The error I get is:
Error: Cannot rbind data to task ‘diagnostic’, missing the following mandatory columns: site, hiv
This happened PipeOp smote’s $train()
Execution halted