I have trained a glmnet model using data that includes a categorical variable as a predictor, with the mlr3
package.
I am trying to make predictions using the glmnet model on new data.
The categorical variable refers to habitat types, and the response variable corresponds to species presence/absence data.
The new dataset contains habitat types that are not included in the training data.
However, I don’t understand how there can be predictions with probability > 0.5 for habitats that are included in the new dataset (e.g., habitats in the northernmost part of the study area) but were not included in the training data.
I would have liked the predictions for habitat types in the new dataset that are not in the training dataset to be set to 0. I thought I could achieve this with:
mlr3pipelines::po("imputeconstant", param_vals = list(constant = 0, affect_columns = selector_grep("habitat_2")))
but it doesn’t seem to work.
Here is my code:
set.seed(1, kind="Mersenne-Twister", normal.kind="Inversion")
data <- read.csv("C:/Users/Downloads/data_species_1.csv")
new_data <- read.csv("C:/Users/Downloads/new_data_species_1.csv")
data$presence <- as.factor(data$presence)
data$habitat_2 <- as.factor(data$habitat_2)
table(data$habitat_2)
classif_task_sp <- mlr3spatial::as_task_classif_st(id = "A1", x = data[, which(!(names(data) %in% c("ID", "year")))], target = "presence", positive = "1",
coordinate_names = c("x", "y"), crs = "EPSG:4326", coords_as_features = FALSE)
classif_task_sp$set_col_roles("presence", roles = c("target", "stratum"))
partition_classif_task_sp <- mlr3::partition(classif_task_sp, ratio = 0.67)
factor_encoding <- mlr3pipelines::po("fixfactors", id = "po_factor_alignment") %>>%
## mlr3pipelines::po("imputeconstant", param_vals = list(constant = 0, affect_columns = selector_grep("level_2_ecoregion_type"))) %>>%
## mlr3pipelines::po("imputeoor", affect_columns = selector_type("factor"), id = "po_factor_imputation") %>>%
## mlr3pipelines::po("imputesample", affect_columns = selector_type(c("ordered", "factor"))) %>>%
mlr3pipelines::po("encodeimpact", affect_columns = selector_cardinality_greater_than(10), id = "high_cardinality_encoding") %>>%
mlr3pipelines::po("encode", method = "one-hot", affect_columns = selector_cardinality_greater_than(2), id = "low_cardinality_encoding") %>>%
mlr3pipelines::po("encode", method = "treatment", affect_columns = selector_type("factor"), id = "binary_encoding") %>>%
mlr3pipelines::po("imputeconstant", param_vals = list(constant = 0, affect_columns = selector_grep("habitat_2")))
## print(factor_encoding)
learner_glmnet <- mlr3tuningspaces::lts(mlr3::lrn("classif.glmnet", predict_type = "prob", standardize = FALSE))
learner_glmnet_factor_encoding <- mlr3::as_learner(factor_encoding %>>% learner_glmnet)
tuning <- mlr3tuning::auto_tuner(tuner = mlr3tuning::tnr("grid_search", resolution = 5, batch_size = 10),
learner = learner_glmnet_factor_encoding,
resampling = mlr3::rsmp("spcv_coords", folds = 2),
measure = mlr3::msr("classif.prauc"),
terminator = mlr3tuning::trm("evals", n_evals = 2, k = 0))
run_training <- tuning$train(classif_task_sp, row_ids = partition_classif_task_sp$train)
predictions <- run_training$predict_newdata(newdata = new_data)
new_data$predictions <- predictions$data$prob[,c("1")]
test <- tidyterra::as_spatraster(tibble::as_tibble(new_data[,c("x", "y", "predictions")], xy = TRUE), crs = "EPSG:4326")
plot(test$predictions)
Here are the datasets “data_species_1.csv” and “new_data_species_1.csv.
https://www.dropbox.com/scl/fo/7psn2wgr955tjvb8d6nm9/AFPgbkMhoo9Glb2JX1MttBs?rlkey=cp6u29ivmz9ad1u1090xv5nfx&st=uqvabdg5&dl=0
2