I have a large matrix (approximately 35,000 x 35,000) and I’m preparing a distance object in R for hierarchical clustering. The base R function dist() is too slow, so I’m using the distances function from the distances package https://cran.r-project.org/web/packages/distances/distances.pdf. I have also implemented parallel processing to speed up the computation, but it still takes around 10 hours to run. Below is the code I am currently using. I utilize the final distance_matrix in the hclustgeo() function from the ClustGeo package.
Is there anyway that I speed this up, since I have another even bigger matrix (48,000 x 48,000) to run?
If I can not make this faster in R, should I switch to Python for better computing power?
Additionally, I have already tried the parDist() function from ParallelDist package, and it is not faster than distances().
###Calcuate the distance object
library(distances)
library(parallel)
library(ClustGeo)
###My data 'dat' is a 35k by 35k matrix
start <- Sys.time()
cl <- makeCluster(detectCores())
registerDoParallel(cl)
distance_matrix <- distance_matrix(distances(dat))
stopCluster(cl)
end <- Sys.time()
end - start ###this runs about 10 hours for 35K by 35K matrix
####Hierarchical clustering using hclustgeo()
tree <- hclustgeo(distance_matrix)