I am working on a binary classification model using an MLR3 pipeline.
The data set is imbalanced and I am using SMOTE to compensate for this.
The predictors are all numerical, but there are two categorical variables that are used only for stratification, along with the outcome variable, for the purposes of train/test splitting and nested cross-validation splitting.
The problem I have encountered is that SMOTE cannot insert extra data with the categorical variables present, even though it doesn’t use them.
The error I get is:
Error: Cannot rbind data to task ‘diagnostic’, missing the following mandatory columns: site, hiv
This happened PipeOp smote’s $train()
Execution halted
Is there some way around this without removing the stratification variables completely or is this simply a limitation of the current implementation of SMOTE in MLR3?
Selecting only the predictor variables before the SMOTE pipeOp does not help as the problem seems to emerge once SMOTE tries to add the new data to the existing training data.
Gian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.