I am running a series of elastic net models with nested cross validation on my training data, using the nestedcv package in R. I am trying to extract performance metrics to compare models with different feature sets. However, I’m still very unclear on which summary function I should call.
Citing from the package vignette:
“Use
summary()
to see full information from the nested model fitting. (…) For comparison, performance metrics from the left-out inner CV test folds can be viewed usinginnercv_summary()
. Performance metrics on the outer training folds can be viewed withtrain_summary()
, provided the argumentouter_train_predict
was set toTRUE
in the original call to eithernestcv.glmnet()
,nestcv.train()
oroutercv()
.”
From this explanation and from the package documentation I don’t understand what’s the difference between calling summary() and train_summary() on my data. I thought summary() returns performance metrics averaged across the outer training folds, but this explanation makes it seem like that’s really what I get from train_summary, and indeed the results are slightly different between the two (see example below). What metrics does summary() return then? And which ones do I want?
#Select features:
features <- mtcars %>%
select(cyl, disp, vs, am) %>%
data.matrix()
#Define outcome column:
outcome <- mtcars %>%
select(mpg) %>%
data.matrix()
#Set model parameters:
myControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
#Define tuning grid:
myGrid <- expand.grid(alpha = seq(0.1, 0.9, length = 10),
lambda = seq(0.1, 0.9, length = 10))
#Tuning both alpha and lambda:
set.seed(123, "L'Ecuyer-CMRG") # for reproducibility
fit <- nestcv.train(
x = features,
y = outcome[, 1],
method = "glmnet",
outer_method = "cv",
n_outer_folds = 5,
trControl = myControl,
tuneGrid = myGrid,
metric = "RMSE",
outer_train_predict = TRUE,
)
> summary(fit)
Nested cross-validation with caret
Method: glmnet
No filter
Outer loop: 5-fold cv
Inner loop: 5-fold repeatedcv
32 observations, 4 predictors
alpha lambda n.filter
Fold 1 0.1 0.9 4
Fold 2 0.1 0.9 4
Fold 3 0.1 0.9 4
Fold 4 0.1 0.9 4
Fold 5 0.1 0.9 4
Final parameters:
alpha lambda
0.1 0.9
Result:
RMSE Rsquared MAE
3.1044 0.7266 2.4648
> train_summary(fit)
RMSE Rsquared MAE
2.7834 0.7833 2.2428
I have read the package documentation but I still struggle understanding this issue, so a noob-friendly explanation would be greatly appreciated. I also tried manually calculating the metrics from the nestcv.train object but none of my results overlap with either summary() or train_summary() output:
> # mean performance from the outer_result/results table
> df <- data.frame(matrix(ncol = 8, nrow = 0))
> names(df) <- names(fit$outer_result[[1]]$fit$results)
>
> for(i in 1:2) {
+ new <- fit$outer_result[[i]]$fit$results %>%
+ filter(alpha == fit$outer_result[[i]]$fit$bestTune[[1]],
+ lambda == fit$outer_result[[i]]$fit$bestTune[[2]])
+ df <- bind_rows(df, new)
+ }
>
> print(mean(df$RMSE, na.rm = T), digits = 4)
[1] 3.323
>
>
> # final_fit RMSE
> print(fit$final_fit$results %>%
+ filter(alpha == fit$final_fit$bestTune[[1]],
+ lambda == fit$final_fit$bestTune[[2]]) %>%
+ .$RMSE,
+ digits = 4)
[1] 3.022
>
>
> # performance calculated manually: avg. across outer test folds
> metrics_all <- c()
>
> for(i in 1:2) {
+ pred <- fit$outer_result[[i]]$preds$predy
+ obs <- fit$outer_result[[i]]$preds$testy
+ rmse <- sqrt(mean((pred - obs)^2, na.rm = T))
+ mae <- mean(abs(pred - obs), na.rm = T)
+ rss <- sum((pred - obs)^2, na.rm = T)
+ tss <- sum((obs - mean(obs))^2, na.rm = T)
+ Rsq <- 1 - rss/tss
+ metrics_all <- rbind(metrics_all, c(rmse, Rsq, mae))
+ }
>
> print(colMeans(metrics_all), digits = 4)
[1] 2.5196 0.8216 2.2331
>
> # performance calculated manually: avg. across outer train folds
> metrics_all_train <- c()
>
> for(i in 1:2) {
+ pred <- fit$outer_result[[i]]$train_preds$predy
+ obs <- fit$outer_result[[i]]$train_preds$ytrain
+ rmse <- sqrt(mean((pred - obs)^2, na.rm = T))
+ mae <- mean(abs(pred - obs), na.rm = T)
+ rss <- sum((pred - obs)^2, na.rm = T)
+ tss <- sum((obs - mean(obs))^2, na.rm = T)
+ Rsq <- 1 - rss/tss
+ metrics_all_train <- rbind(metrics_all_train, c(rmse, Rsq, mae))
+ }
>
> print(colMeans(metrics_all_train), digits = 4)
[1] 2.952 0.752 2.351