I trained a random forest model as below:
library(mlbench)
data(Sonar)
x<-Sonar[,1:60]
y<-Sonar[,61]
library(caret)
set.seed(1234)
cv_folds <- createFolds(y, k = 10, returnTrain = TRUE)
ctrl <- trainControl(method = "cv",
number = 10,
search = 'grid',
classProbs = TRUE,
savePredictions = TRUE,
index = cv_folds,
summaryFunction = twoClassSummary)
tuneGrid <- expand.grid(.mtry = c(1:10))
set.seed(123)
rf_model <- train(Class~.,
data = Sonar,
method = "rf",
importance=TRUE,
metric = "ROC",
tuneGrid = tuneGrid,
trControl = ctrl,
ntree = 1500,
nodesize = 5)
rf_model$bestTune ## mtry=2
best_pred<-rf_model$pred[which(rf_model$pred$mtry==2),]
best_pred<-best_pred[order(best_pred$rowIndex),]
df<-data.frame(M1=best_pred$M,R1=best_pred$R, M2=rf_model$finalModel$votes[,1],R2=rf_model$finalModel$votes[,2])
> head(df)
M1 R1 M2 R2
X1 0.5280000 0.4720000 0.5257937 0.4742063
X2 0.5666667 0.4333333 0.5408922 0.4591078
X3 0.6033333 0.3966667 0.5660036 0.4339964
X4 0.4473333 0.5526667 0.4807339 0.5192661
X5 0.4986667 0.5013333 0.4819048 0.5180952
X6 0.4246667 0.5753333 0.4332724 0.5667276
So why df$M1
and df$R1
are different from df$M2
and df$R2
? I only know that df$M1
and df$R1
are OOB predictions that generated from 10-fold cross validation (and shouldn’t be used to evaluate the trainset performance). But I am not sure what exactly rf_model$finalModel$votes
is. I thought the final model was built based on the rf_model$bestTune
(i.e mtry=2), and then it should be the same between (df$M1
and df$M2
) and (df$R1
and df$R2
).
Can I use rf_model$finalModel$votes
to calculate ROC to further evaluate the trainset performance? Previously I tried:
rf_pred<-predict(rf_model, newdata=Sonar,type="prob")
df<-data.frame(pred=rf_pred$R,origin=ifelse(Sonar$Class=="R",1,0))
pred_obj <- ROCR::prediction(df[["pred"]], df[["origin"]])
auc <- ROCR::performance(pred_obj, measure = "auc")
[email protected] #1
But the AUC value was too perfect to be true. So I am looking for a more reasonable one for trainset performance evaluation.