1I am trying to build a decision tree model predicting an outcome variable (named : Results) based on predictor variable. Indeed, I have applied one-hot encoding on some of the “>2 level” variables to enable expanding the n of predictors a bit [My data].
I first explored the data and then split it into 80/20 split and run the model, but the model run on trainign data set ends with only one node with no branches. Looking to similar posts, I sought that my data is unbalanced because by checking the prop.table of the class assignation (of the Results variable), the majority was for negative rather than for positive. Any suggestions for creating a correct tree on this data
Here comes my code:
splitting the data into test and train data (80% train and 20% test data)
set.seed(1234)
pd <- sample(2, nrow(data_hum_mod), replace = TRUE, prob = c(0.8,0.2))
data_hum_train <- data_hum_mod[pd==1,]
data_hum_test<- data_hum_mod[pd==2,]
Data exploration after splitting
Check the data dimention
dim(data_hum_train); dim(data_hum_test)
#make sure that the spllited data have balanced n of each of the outcome classes (i.e. positive/negative toxo)
prop.table(table(data_hum_train$Results)) * 100
prop.table(table(data_hum_test$Results)) *100
check missing values
anyNA(data_hum_mod)
#Make sure none of the variables are zero or near-zero variance.
nzv(data_hum_mod)
building the model (using party package)
install.packages('party')
library(party)
data_human_train_tree<- ctree(Results ~., data = data_hum_train,
controls = ctree_control(mincriterion = 0.1))
data_human_train_tree
plot(data_human_train_tree)
With this code I obtained this figure
It gave me the same results using other packages like C50 and rpart
Could you advice on this ? and I read about the subsampling for the majority class (here is the negative Results), how can One implement this in R ?
Thanks