I am trying to train a Random Forest model in R, with this piece of code:
# Load necessary libraries
library(randomForest)
library(readxl)
library(caret)
library(e1071)
# Load the original DataFrame with labels
df <- read_excel('~/Downloads/tfidf_r.xlsx')
df$label...2 <- as.factor(df$label...2)
# Split the data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(df$review_id, p = 0.7, list = FALSE)
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
# Initialize and train the Random Forest classifier with regularization
random_forest_model <- train(label...2 ~ ., data = train_data, method = "rf",
trControl = trainControl(method = "cv", number = 10))
# Print the model
print(random_forest_model)
# Make predictions on the test data
y_pred <- predict(random_forest_model, newdata = test_data)
I don’t see any reason in this code to have overfitting or hyperparameters.
Because I get this output:
Accuracy : 1
95% CI : (0.9757, 1)
No Information Rate : 0.6067
P-Value [Acc > NIR] : < 2.2e-16
Which makes no sense, it can’t have accuracy = 1.
also, it doesn’t make sense to have mtry=5477.
mtry Accuracy Kappa
2 0.6050980 0.0000000
104 0.9341923 0.8576043
5477 0.9942017 0.9877979
Accuracy was used to select the optimal model using
the largest value.
The final value used for the model was mtry = 5477.
Why happens, where am I wrong?