I have a large number of RDS files that I need to open, modify and save in place. One iteration of this takes ~1.8s. I have about 40K files that need to be modified, so I attempted to run in parallel. Using 28 processors, it seems like it should take less than an hour to complete, but instead it is taking 4-5x that long. What can be done to fix this? Each file is read by exactly one thread, so there should not be any locking going on. I tried chunking it into blocks of 100 files, but that doesn’t help either. I would expect some overheard from the parallel computations, but this seems way out of line to me.
Here is some sample code:
library(parallel)
library(pbapply)
f = function(x){
y = readRDS(x)
# modify something in y
saveRDS(y,x)
}
files = list.files("C:\my-dir", full.names = T)
cl = makeCluster(28)
result = pbsapply(files, f, cl = cl)