To compute the variable importance of a random forest model with a continuous outcome, I want to compute the SHAP values on training and test sets at each fold of a k-fold cross-validation process and average them. This has been done by Scheda and al. and seems to be easily done in Python. However, there seems to be no easy way to accomplish this in R. Here is a schema of what I’m trying to do, based on Scheda and al.
Here is where I’m at, with a replicable example:
- usual for-loop for k-fold cross-validation
library(iml)
library(caret)
library(tidyverse)
#upload car data
data(cars)
#create the folds
folds.samp = createFolds(cars$Price,k=4,list=F)
#create the for loop prediction on k-fold
rf_preds = list()#empty list for out-of-sample predictions
rf_models= list()#list with the models
#for loop: this gives me trained and validated models (one per fold) on training data and tested on unseen data
for(j in 1:4){
rf = caret::train(Price~.,cars[folds.samp!=j,],
method = "rf",
trControl=trainControl(method="cv"),
ntree=150)#ntrees usually set at 2000 for robust variable importance, but set as 150 for the purpose of the example
rf_models=c(rf_models,list(rf))
rf_preds[[j]]=predict(rf,cars[folds.samp==j,])
}
- I tried to incorporate another for-loop in the k-fold for-loop to extract the training SHAP values of each row at each fold, but this doesn’t work. I know the code is messy, but I just don’t know any other way to do this.
#create the for loop prediction on k-fold
rf_preds = list()#empty list for out-of-sample predictions
rf_models= list()#list with the models
shap_vals_matrix_list=list()
#for loop: this gives me trained and validated models (one per fold) on training data and tested on unseen data
for(j in 1:5){
rf = caret::train(Price~.,cars[folds.samp!=j,],
method = "rf",
trControl=trainControl(method="cv"),
ntree=150)#ntrees usually set at 2000 for robust variable importance, but set as 150 for the purpose of the example
rf_models=c(rf_models,list(rf))
rf_preds[[j]]=predict(rf,cars[folds.samp==j,])
###Shapley value part of the loop###
mod = Predictor$new(rf_models[[j]]$finalModel, data = cars)#create predictor object
# Create an empty matrix to collect the results
shap_vals <- matrix(NA,nrow(cars[folds.samp!=j,]),ncol(cars)-1)
# training data for predictors
X <- cars[folds.samp!=j, -which(names(cars) == "Price")]
# Create a Predictor object
predictor <- Predictor$new(model = rf_models[[j]]$finalModel, data = X, y = cars[folds.samp!=j,"Price"])
#create an inner loop to compute the actual SHAP values
for(i in 1:nrow(cars[folds.samp!=j,])){
shapley <- Shapley$new(mod, x.interest = cars[i,-1])
shap_vals[i,] = abs(shapley$results$phi)[1:(ncol(cars)-1)]
mean_shap_values <- colMeans(shap_vals, na.rm = TRUE)#mean values per row
}
shap_vals_matrix_list=c(shap_vals_matrix_list,list(mean_shap_values))
}
I would like some help to either make the code work or to make it way more simpler!