I’ve started learning about ML recently so I’m not really good at it.
My ML project is centered around a df that contains demographic, social, etc indicators abput every country of the world. My target variable is binary, Yes or No.
Since I’m considering every year from 1960 to 2022 for each country, there are some missing values.
I’ve trained some ML classification models, but I’m not able to predict on the test dataset.
Given that my train and test df are stratified by Country, like this:
test <- stratified(data5_scaled, "Country", size = 0.4)
train<- anti_join(data5_scaled, test)
for example, I’ve trained a rf in this way:
ctrl_rf <- trainControl(method="cv", number=10, search="grid", summaryFunction = twoClassSummary, classProbs = TRUE)
tunegrid_rf <- expand.grid(.mtry=c(1:6))
rf <- train(Target~., data=train, method="rf", metric="ROC", tuneGrid=tunegrid_rf, ntree=500,
trControl=ctrl_rf,na.action=na.exclude)
and when I’m trying to predict
test$rf=predict(rf,test, "prob")[,1]
I get this error:
Error in set(x, j = name, value = value) :
Supplied 199 items to be assigned to 5425 items of column 'rf'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
and obviously the test df has exactly 199 complete rows out 5425.
I know I could always impute missing values, but I think it’s not a good idea given the type of data, and I’ve already imputed the most I could do.