In order to analyze genome-wide association data with a particular modeling tool, I need a design matrix X that has columns for both the non-genomic factors for which I want to adjust and the genomic features. To create such a matrix, I need a function that acts like cbind()
— in particular, I need to column-wise combine the in-memory matrix A of non-genomic predictors with the filebacked matrix B of genomic predictors. Note that B will have type = 'raw'
(coming from an upstream function in the script), as all values in B are 0, 1, or 2. I need something like C <- cbind(A, B)
, where C is a filebacked matrix with type = 'double'
.
Here is the draft of my function:
#' a cbind() function for high-dim data
#'
#' @param A in-memory data (a numeric matrix)
#' @param B file-backed data (a filebacked.big.matrix with type = 'raw')
#' @param C file-backed placeholder for combined data (a filebacked.big.matrix with type = 'double')
#' @param quiet Logical
#'
#' @return C, filled in with all column values of A and B combined
big_cbind <- function(A, B, C, quiet){
if (!quiet) {
pb <- txtProgressBar(min = 0, max = ncol(A) + ncol(B), style = 3)
}
# cat("fill in first few columns of C with An")
# fill in first few columns of C with A
for (j in 1:ncol(A)) {
C[, j] <- A[, j, drop=TRUE]
if (!quiet) {
}
}
# cat("fill in the rest of the columns of C with B")
# fill in the rest of the columns of C with B
for (j in 1:ncol(B)) {
C[, j + ncol(A)] <- as.numeric(B[, j, drop=TRUE])
if (!quiet) setTxtProgressBar(pb, j)
}
if (!quiet) {
close(pb)
}
return(C)
}
The example below shows that the function above is both (1) slow and (2) taking up memory in the R session. Problem (1) is a challenge I anticipated; problem (2) is counterintuitive to me. At what point in the example below is data from a filebacked object getting read into or stored in memory?
library(bigstatsr)
library(bigmemory)
# n = number of observations
n <- 2000
# p = number of features (e.g., # of SNPs)
# Note: this is unrealistically small 'p' for GWAS - I have about 500,000 SNPs in
# my real data
p <- 50000
# matrix of non-genetic predictors (e.g., age, sex, ...)
A <- matrix(rnorm(n*2), n, 2)
# matrix of genotype data from PLINK files
fake_bin_data <- sample(0:2, size = n*p, replace = TRUE, prob = c(0.5, 0.3, 0.2))
B <- matrix(fake_bin_data, n, p) |> bigstatsr::as_FBM(type = 'raw')
B <- B$bm() # need this to be a big.matrix for my current implementation
C <- bigstatsr::FBM(nrow = nrow(B),
ncol = ncol(A) + ncol(B),
type = 'double')
C <- C$bm()
Look at memory at this point:
sort(sapply(ls(), function(x){object.size(get(x))}))
Now observe how much time/memory big_cbind()
is using:
system.time(
C <- big_cbind(A, B, C, quiet = FALSE)
)
C[1:20,1:10] # desired outcome: design matrix representing all predictors
# see how much memory is used now
sort(sapply(ls(), function(x){object.size(get(x))}))
Moreover, my RStudio environment says that current memory usage is over 1.5 GB. As I scale up to GWAS-sized data, this becomes a real issue. Any ideas on what is going on in my function that make the memory usage increase? Or suggestions on how I could make this more efficient?
Tabitha Peter is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1