I’m working on a machine learning project, and my dataset contains variables about social, demographic and economic aspects of 218 countries, ranging from 1960 to 2022. The data has very little number of missing values, most of them beign related to categorical variables. The target variable is a binary variable (Yes or No) that represents if the country has had at least one attempt of coup d’etat in a specific year, So I used machine learning models that handle multilevel data.
I used these models:
library(glmmTMB)
logistic <- glmmTMB(Target ~ Year+V3+V5+V11+V14+V15+V16+V19+V27+V28+V29+V33+V34+V37+V38+
V41+V42+V45+V46+V49+V59+V61+V62+V63+V67+V68+V69+V72+V73+V74+V75+
V78+V79+V81+V105+V106 + (1 | Country), data=under, family = ‘binomial’)
library(rpart)
tree_model <- rpart(Target ~ ., data = under, method = “class”)
library(randomForest)
rf_model <- randomForest(Target ~ ., data = under, ntree = 500)
Error in randomForest.default(m, y, …) :
Can not handle categorical predictors with more than 53 categories.
metric <- “ROC”
control <- trainControl(method=”cv”, number=10, search=”grid”, summaryFunction = twoClassSummary, classProbs = TRUE)
tunegrid <- expand.grid(.mtry=c(1:6))
rf_model <- train(Target~., data=under, method=”rf”, metric=metric, tuneGrid=tunegrid, ntree=100, trControl=control)
And all of these models don’t work, for example, the logistic model doesn’t converge at all; or randomForest doesn’t work at all, and the others dont’perform very well
Rickcolo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.