I’m working on a ML project and I’m trying to train some classification models and then make some predictions on the test df.
How do I train a model that’s capable of using every observation available, regardless it has missing values or not? And how do I make predictions?
Most of the observations from my df have at least one missing value.
To train my models I’m using the library caret
.
For example, given this model:
control <- trainControl(method="cv", number=10, search="grid", summaryFunction = twoClassSummary, classProbs = TRUE)
tunegrid <- expand.grid(.mtry=c(1:6))
rf <- train(Target~., data=train, method="rf", metric="ROC", tuneGrid=tunegrid, ntree=100,trControl=control)
and then I make predictions this way:
test$pred<-predict(rf,test,'prob')[,2]
While training, I’ve already tried this na.action
options:
na.omit
;
–na.exclude
;
na.pass
.
The first two work fine, but if I use na.pass
I get this error:
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :6 NA's :6 NA's :6
Error: Stopping
warnings()
1: model fit failed for Fold01: mtry=1 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
NA not permitted in predictors
And if I use one of the first two, when I make predictions, I get erorrs similar to this one:
Error in set(x, j = name, value = value) :
Supplied 199 items to be assigned to 5425 item