Thiết kế website giá rẻ

Question

I am running a series of elastic net models with nested cross validation on my training data, using the nestedcv package in R. I am trying to extract performance metrics to compare models with different feature sets. However, I’m still very unclear on which summary function I should call.

Citing from the package vignette:

“Use summary() to see full information from the nested model fitting. (…) For comparison, performance metrics from the left-out inner CV test folds can be viewed using innercv_summary(). Performance metrics on the outer training folds can be viewed with train_summary(), provided the argument outer_train_predict was set to TRUE in the original call to either nestcv.glmnet(), nestcv.train() or outercv().”

From this explanation and from the package documentation I don’t understand what’s the difference between calling summary() and train_summary() on my data. I thought summary() returns performance metrics averaged across the outer training folds, but this explanation makes it seem like that’s really what I get from train_summary, and indeed the results are slightly different between the two (see example below). What metrics does summary() return then? And which ones do I want?

<code>#Select features:

features <- mtcars %>%

select(cyl, disp, vs, am) %>%

data.matrix()

#Define outcome column:

outcome <- mtcars %>%

select(mpg) %>%

data.matrix()

#Set model parameters:

myControl <- trainControl(

method = "repeatedcv",

number = 5,

repeats = 5)

#Define tuning grid:

myGrid <- expand.grid(alpha = seq(0.1, 0.9, length = 10),

lambda = seq(0.1, 0.9, length = 10))

#Tuning both alpha and lambda:

set.seed(123, "L'Ecuyer-CMRG") # for reproducibility

fit <- nestcv.train(

x = features,

y = outcome[, 1],

method = "glmnet",

outer_method = "cv",

n_outer_folds = 5,

trControl = myControl,

tuneGrid = myGrid,

metric = "RMSE",

outer_train_predict = TRUE,

)

> summary(fit)

Nested cross-validation with caret

Method: glmnet

No filter

Outer loop: 5-fold cv

Inner loop: 5-fold repeatedcv

32 observations, 4 predictors

alpha lambda n.filter

Fold 1 0.1 0.9 4

Fold 2 0.1 0.9 4

Fold 3 0.1 0.9 4

Fold 4 0.1 0.9 4

Fold 5 0.1 0.9 4

Final parameters:

alpha lambda

0.1 0.9

Result:

RMSE Rsquared MAE

3.1044 0.7266 2.4648

> train_summary(fit)

RMSE Rsquared MAE

2.7834 0.7833 2.2428

</code>

<code>#Select features: features <- mtcars %>% select(cyl, disp, vs, am) %>% data.matrix() #Define outcome column: outcome <- mtcars %>% select(mpg) %>% data.matrix() #Set model parameters: myControl <- trainControl( method = "repeatedcv", number = 5, repeats = 5) #Define tuning grid: myGrid <- expand.grid(alpha = seq(0.1, 0.9, length = 10), lambda = seq(0.1, 0.9, length = 10)) #Tuning both alpha and lambda: set.seed(123, "L'Ecuyer-CMRG") # for reproducibility fit <- nestcv.train( x = features, y = outcome[, 1], method = "glmnet", outer_method = "cv", n_outer_folds = 5, trControl = myControl, tuneGrid = myGrid, metric = "RMSE", outer_train_predict = TRUE, ) > summary(fit) Nested cross-validation with caret Method: glmnet No filter Outer loop: 5-fold cv Inner loop: 5-fold repeatedcv 32 observations, 4 predictors alpha lambda n.filter Fold 1 0.1 0.9 4 Fold 2 0.1 0.9 4 Fold 3 0.1 0.9 4 Fold 4 0.1 0.9 4 Fold 5 0.1 0.9 4 Final parameters: alpha lambda 0.1 0.9 Result: RMSE Rsquared MAE 3.1044 0.7266 2.4648 > train_summary(fit) RMSE Rsquared MAE 2.7834 0.7833 2.2428 </code>

#Select features:
features <- mtcars %>%
  select(cyl, disp, vs, am) %>%
  data.matrix()

#Define outcome column:
outcome <- mtcars %>%
  select(mpg) %>%
  data.matrix()

#Set model parameters:
myControl <- trainControl(
    method = "repeatedcv",
    number = 5,
    repeats = 5)

#Define tuning grid:
myGrid <- expand.grid(alpha = seq(0.1, 0.9, length = 10),
                        lambda = seq(0.1, 0.9, length = 10))

#Tuning both alpha and lambda:
  
set.seed(123, "L'Ecuyer-CMRG") # for reproducibility
fit <- nestcv.train(
  x = features,
  y = outcome[, 1],
  method = "glmnet",
  outer_method = "cv",
  n_outer_folds = 5,
  trControl = myControl,
  tuneGrid = myGrid,
  metric = "RMSE",
  outer_train_predict = TRUE,
)



> summary(fit)
Nested cross-validation with caret
Method:  glmnet 
No filter
Outer loop:  5-fold cv
Inner loop:  5-fold repeatedcv
32 observations, 4 predictors

        alpha  lambda  n.filter
Fold 1    0.1     0.9         4
Fold 2    0.1     0.9         4
Fold 3    0.1     0.9         4
Fold 4    0.1     0.9         4
Fold 5    0.1     0.9         4

Final parameters:
  alpha  lambda
    0.1     0.9

Result:
    RMSE   Rsquared        MAE   
  3.1044     0.7266     2.4648   
> train_summary(fit)
    RMSE   Rsquared        MAE   
  2.7834     0.7833     2.2428

I have read the package documentation but I still struggle understanding this issue, so a noob-friendly explanation would be greatly appreciated. I also tried manually calculating the metrics from the nestcv.train object but none of my results overlap with either summary() or train_summary() output:

<code>> # mean performance from the outer_result/results table

> df <- data.frame(matrix(ncol = 8, nrow = 0))

> names(df) <- names(fit$outer_result[[1]]$fit$results)

>

> for(i in 1:2) {

+ new <- fit$outer_result[[i]]$fit$results %>%

+ filter(alpha == fit$outer_result[[i]]$fit$bestTune[[1]],

+ lambda == fit$outer_result[[i]]$fit$bestTune[[2]])

+ df <- bind_rows(df, new)

+ }

>

> print(mean(df$RMSE, na.rm = T), digits = 4)

[1] 3.323

>

> # final_fit RMSE

> print(fit$final_fit$results %>%

+ filter(alpha == fit$final_fit$bestTune[[1]],

+ lambda == fit$final_fit$bestTune[[2]]) %>%

+ .$RMSE,

+ digits = 4)

[1] 3.022

>

> # performance calculated manually: avg. across outer test folds

> metrics_all <- c()

>

> for(i in 1:2) {

+ pred <- fit$outer_result[[i]]$preds$predy

+ obs <- fit$outer_result[[i]]$preds$testy

+ rmse <- sqrt(mean((pred - obs)^2, na.rm = T))

+ mae <- mean(abs(pred - obs), na.rm = T)

+ rss <- sum((pred - obs)^2, na.rm = T)

+ tss <- sum((obs - mean(obs))^2, na.rm = T)

+ Rsq <- 1 - rss/tss

+ metrics_all <- rbind(metrics_all, c(rmse, Rsq, mae))

+ }

>

> print(colMeans(metrics_all), digits = 4)

[1] 2.5196 0.8216 2.2331

>

> # performance calculated manually: avg. across outer train folds

> metrics_all_train <- c()

>

> for(i in 1:2) {

+ pred <- fit$outer_result[[i]]$train_preds$predy

+ obs <- fit$outer_result[[i]]$train_preds$ytrain

+ rmse <- sqrt(mean((pred - obs)^2, na.rm = T))

+ mae <- mean(abs(pred - obs), na.rm = T)

+ rss <- sum((pred - obs)^2, na.rm = T)

+ tss <- sum((obs - mean(obs))^2, na.rm = T)

+ Rsq <- 1 - rss/tss

+ metrics_all_train <- rbind(metrics_all_train, c(rmse, Rsq, mae))

+ }

>

> print(colMeans(metrics_all_train), digits = 4)

[1] 2.952 0.752 2.351

</code>

<code>> # mean performance from the outer_result/results table > df <- data.frame(matrix(ncol = 8, nrow = 0)) > names(df) <- names(fit$outer_result[[1]]$fit$results) > > for(i in 1:2) { + new <- fit$outer_result[[i]]$fit$results %>% + filter(alpha == fit$outer_result[[i]]$fit$bestTune[[1]], + lambda == fit$outer_result[[i]]$fit$bestTune[[2]]) + df <- bind_rows(df, new) + } > > print(mean(df$RMSE, na.rm = T), digits = 4) [1] 3.323 > > > # final_fit RMSE > print(fit$final_fit$results %>% + filter(alpha == fit$final_fit$bestTune[[1]], + lambda == fit$final_fit$bestTune[[2]]) %>% + .$RMSE, + digits = 4) [1] 3.022 > > > # performance calculated manually: avg. across outer test folds > metrics_all <- c() > > for(i in 1:2) { + pred <- fit$outer_result[[i]]$preds$predy + obs <- fit$outer_result[[i]]$preds$testy + rmse <- sqrt(mean((pred - obs)^2, na.rm = T)) + mae <- mean(abs(pred - obs), na.rm = T) + rss <- sum((pred - obs)^2, na.rm = T) + tss <- sum((obs - mean(obs))^2, na.rm = T) + Rsq <- 1 - rss/tss + metrics_all <- rbind(metrics_all, c(rmse, Rsq, mae)) + } > > print(colMeans(metrics_all), digits = 4) [1] 2.5196 0.8216 2.2331 > > # performance calculated manually: avg. across outer train folds > metrics_all_train <- c() > > for(i in 1:2) { + pred <- fit$outer_result[[i]]$train_preds$predy + obs <- fit$outer_result[[i]]$train_preds$ytrain + rmse <- sqrt(mean((pred - obs)^2, na.rm = T)) + mae <- mean(abs(pred - obs), na.rm = T) + rss <- sum((pred - obs)^2, na.rm = T) + tss <- sum((obs - mean(obs))^2, na.rm = T) + Rsq <- 1 - rss/tss + metrics_all_train <- rbind(metrics_all_train, c(rmse, Rsq, mae)) + } > > print(colMeans(metrics_all_train), digits = 4) [1] 2.952 0.752 2.351 </code>

> # mean performance from the outer_result/results table
> df <- data.frame(matrix(ncol = 8, nrow = 0))
> names(df) <- names(fit$outer_result[[1]]$fit$results)
> 
> for(i in 1:2) {
+   new <- fit$outer_result[[i]]$fit$results %>%
+     filter(alpha == fit$outer_result[[i]]$fit$bestTune[[1]],
+            lambda == fit$outer_result[[i]]$fit$bestTune[[2]])
+   df <- bind_rows(df, new)
+ }
> 
> print(mean(df$RMSE, na.rm = T), digits = 4)
[1] 3.323
> 
> 
> # final_fit RMSE
> print(fit$final_fit$results %>%
+         filter(alpha == fit$final_fit$bestTune[[1]],
+                lambda == fit$final_fit$bestTune[[2]]) %>%
+         .$RMSE,
+       digits = 4)
[1] 3.022
> 
> 
> # performance calculated manually: avg. across outer test folds
> metrics_all <- c()
> 
> for(i in 1:2) {
+   pred <- fit$outer_result[[i]]$preds$predy
+   obs <- fit$outer_result[[i]]$preds$testy
+   rmse <- sqrt(mean((pred - obs)^2, na.rm = T))
+   mae <- mean(abs(pred - obs), na.rm = T)
+   rss <- sum((pred - obs)^2, na.rm = T)
+   tss <- sum((obs - mean(obs))^2, na.rm = T)
+   Rsq <- 1 - rss/tss
+   metrics_all <- rbind(metrics_all, c(rmse, Rsq, mae))
+ }
> 
> print(colMeans(metrics_all), digits = 4)
[1] 2.5196 0.8216 2.2331
> 
> # performance calculated manually: avg. across outer train folds
> metrics_all_train <- c()
> 
> for(i in 1:2) {
+   pred <- fit$outer_result[[i]]$train_preds$predy
+   obs <- fit$outer_result[[i]]$train_preds$ytrain
+   rmse <- sqrt(mean((pred - obs)^2, na.rm = T))
+   mae <- mean(abs(pred - obs), na.rm = T)
+   rss <- sum((pred - obs)^2, na.rm = T)
+   tss <- sum((obs - mean(obs))^2, na.rm = T)
+   Rsq <- 1 - rss/tss
+   metrics_all_train  <- rbind(metrics_all_train,  c(rmse, Rsq, mae))
+ }
> 
> print(colMeans(metrics_all_train), digits = 4)
[1] 2.952 0.752 2.351

Thiết kế website giá rẻ

Danh mục

What is the difference between calling summary() and train_summary() on a nestcv.train object in the nestedcv package in R?