I have several large GAM models to fit (lots of data, various distributions for residuals and random effect smooths). Currently, I use foreach
with a cluster cpus based on the parallel
package to run each of the models using the gam
function on an individual node. This is an improvement, but I am wondering if it is possible to use parallelization within the bam
function itself while at the same time running each model using foreach
. Ideally, I would like the results of the model runs to be as similar as possible to those of the original gam
function. Should I use the cluster
argument to bam
or the nthreads
argument?
An example code using the cluster
argument would be (my actual code is different than this, but the idea is the same):
library(parallel)
library(doParallel)
library(foreach)
library(mgcv)
nprocs = 16
cl <- makeCluster(nprocs)
registerDoParallel(cl)
results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
data = D, family="tw",cluster=cl)
summary(gam_model)
}
parallel::stopCluster(cl)
Alternatively, I could try something like:
library(parallel)
library(doParallel)
library(foreach)
library(mgcv)
nprocs = 16
cl <- makeCluster(nprocs)
registerDoParallel(cl)
nthreads = floor(nprocs/6)
results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
data = D, family="tw",nthreads=nthreads,discrete=TRUE)
summary(gam_model)
}
parallel::stopCluster(cl)
Which, if either, of these is the best approach? The fundamental blockage is that I do not understand how foreach
and bam
are going to interact (e.g., how calculations will be dispatched to the available processors) and what the differences will be if I use cluster
versus nthreads
.