I need to standardize (subtract the mean and divide by the standard deviation) the columns of several large matrices in R (roughly 300,000 rows by 10,000-20,000 columns).
The process has been very slow, so I tried to speed it up with foreach/doParallel, but that doesn’t seem to be helping.
Some example code with a smaller matrix:
library('doParallel')
n.cores <- 7
clust <- makeCluster(n.cores,type='FORK')
registerDoParallel(cl = clust)
rows <- 300000
cols <- 1500
big_matrix <- matrix(runif(rows*cols),nrow=rows,ncol=cols,dimnames=list(paste0('r',1:rows),paste0('c',1:cols)))
# Attempt 1: No parallelization. Took 52.3 seconds.
system.time( stand1 <- scale(big_matrix,center=T,scale=T) )
# Attempt 2: Each column separately with foreach. Took 168.9 seconds.
system.time(
stand2 <- foreach(mchunk=big_matrix,.combine='cbind') %dopar% { #
(mchunk - mean(mchunk))/sd(mchunk)
}
)
# Attempt 3: In chunks with foreach. Took 52.7 seconds.
stand_chunks <- function(mat,n.cores) {
mchunks <- split(1:ncol(mat),cut(1:ncol(mat),n.cores)) # A list of column indices, splitting the matrix into n.cores chunks
stand_mat <- foreach(mc=mchunks,.combine='cbind') %dopar% {
scale(mat[,mc],center=T,scale=T)
}
return(stand_mat)
}
system.time( stand3 <- stand_chunks(big_matrix,n.cores) )
parallel::stopCluster(cl=clust)
I can understand why attempt 2 was slow – the overhead of setting up workers just to have them each perform one quick little operation outweighs the benefits of parallelization.
I don’t understand why attempt 3 wasn’t any faster than attempt 1. Attempt 3 is also very memory inefficient – it has the issue described here as “giving workers more data than they need” – if I watch the threads using the “top” command I can see the memory usage of each thread going up as the entire big matrix is copied for each worker. By the time I increase the matrix to 5000 columns, this copying is memory-wasteful enough that it will run out of memory on a HPC compute node with 80 Gb RAM.
I have two questions:
- What’s a more efficient way to standardize these large matrices (with foreach or via some other method)?
- If I want to do “something” (standardization or some other task) to the columns of a large matrix using foreach, what’s a better way to do it? Example 2 is slow because doing one column at a time has too much overhead, Example 3 is memory-inefficient because when I go by chunks instead of by column I end up copying the entire matrix to each worker… what’s a better way to process a large matrix using foreach?