I have a random forest model that I’m trying to understand better.
For the sake of the example, lets say we have a grove of blueberry bushes. What we’re interested in is predicting the production of rotten blueberries on a specific bush, among harvest of all blueberries of the individual bushes.
Each bush has an identifying name: bush_name, such as ‘bush001’, and we want predictions based on each individual bush. For example, I want to know if bush025 produced a rotten berry on 2/2/22.
Inputs are in a df with the following dummy structure for the sake of this example:
train_data <- data.frame(date = c("2022-01-01", "2022-01-07", "2022-02-09", "2022-05-01", "2022-11-01", "2022-11-02"),
bush_name = c("bush001", "bush001", "bush001", "bush043", "bush043", "bush043"),
bugs = c(2, 0, 1, 0, 3, 1),
has_rotten_berry = c(1, 0, 0, 1, 1, 0),
berry_count = c(12, 1, 7, 100, 14, 4),
weather = c(1, 0, 2, 0, 1, 1))
I’ve got a random forest model that I have set up with the following high level set up:
library(agua)
library(parsnip)
library(h2o)
h2o.init(nthreads = -1)
model_fit <- rand_forest(mtry = 10, trees = 100) %>%
set_engine("h2o") %>%
set_mode("classification") %>%
fit(has_rotten_berry ~ .,
data = train_data) %>%
step_dummy(bush_name) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
What I want to know is:
When I try to then predict on new data in the trained model, it seems that I am only able to input new test data with the bush_names of bushes I already trained on. Am I correct in assuming this model is creating bush-specific predictions? And therefore would have to input new bush information in the training in order to output a future prediction for those new bushes?
Example: I plant a new bush, bush700, and it was not present in the original training data set. If I try to predict with the new bush data without it being present in the training data, is giving me a message that there are new levels in the data. So I’m assuming that because it seems these predictions are bush-specific, and we can’t get any new bush predictions for newly-added bushes.
Is this correct to assume?
Thank you and I apologize for the potentially confusing metaphor. Also open to any other feedback you might have on the model.