I am trying to train a Random Forest model in R for sentiment analysis.
The model works with tf-idf matrix and learns from it how to classify a review, in positive or negative.
Positive ones are classified with label 1, and negative ones are classified with label 0.
I created a code in R based on the logic of training a model given 2 labels converted to factors, and then dividing the database in training and test datasets.
A similar logic of code worked well with naive Bayes, but with RF it seems there is a problem.
I want to know if the problem is the algorithm of the code, because I am new in R and I am not sure I have used the right code.
# Load necessary libraries
library(randomForest)
library(readxl)
library(caret)
library(e1071)
# Load the original DataFrame with labels
df <- read_excel('~/Downloads/tfidf_r.xlsx')
df$label...2 <- as.factor(df$label...2)
# Split the data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(df$review_id, p = 0.7, list = FALSE)
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
# Initialize and train the Random Forest classifier with regularization
random_forest_model <- train(label...2 ~ ., data = train_data, method = "rf",
trControl = trainControl(method = "cv", number = 10))
# Print the model
print(random_forest_model)
# Make predictions on the test data
y_pred <- predict(random_forest_model, newdata = test_data)
What makes me suspicious is the fact that I don’t see any reason in this code to have hyper parameters.
Here you can see the output:
Accuracy : 1
95% CI : (0.9757, 1)
No Information Rate : 0.6067
P-Value [Acc > NIR] : < 2.2e-16
2