I am looking for an algorithm (pseudocode, R code) for my problem here. This looks like a lot to read but is actually very very easy to read, and I am just trying to be verbose.
I have a list (say coords
) with 8 million elements. They look like this:
> head(coords)
cstart cend
1 35085 35094
2 35089 35098
3 35374 35383
4 35512 35521
5 35513 35522
6 35514 35523
...
98 12600309 12600318
What these elements represent are coordinates on a DNA string. Ie, 35374
refers to the position on the DNA string. I don’t care whether its a A C G T
for what its worth. AND ALL coords are length 10. So the cend
field can be calculated. (I don’t actually use this cend
in my code)
I have a secondary list (say alignments
) (~100 elements) that also have a start
and end
coordinate. This defines are area of interest. It looks like this
chr_name chr_strand start_index end_index
"chr1" "+" "12600311" "12600325"
The columns chr_name, chr_strand
is not important. In this case, our “area of interest” could be larger than 10.
My problem
Now, what I want to do is go through all 8 million records and see if each record is contained in the interests
list (up to 80% overlap, I will explain this). For the above example, the first 6 elements are nowhere “near” the area of interest.
Row 98
starts 2 “base pairs” before our area of interest, BUT overlaps the area of interest up to 80%. Ie, 80% of the “site” is contained in the area of interest.
Result
Basically, I am looking for a list (I can preallocate it), where each element is a TRUE/FALSE. So from above I’ll have a list like the following
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ..., TRUE (97'th index)
My current R code
I have documented R code that is very easy to read
## mdply is a function that acts essentially like a loop
## the .data is the data I am working with.. the 8 million records
## for each row, it executes a function where the signature of the function
## matches the column names.
## the best part about this .. IS THAT ITS PARALLELIZED
## and I can use up to 24 cores to do this.
b <- mdply(.data = coords,
function(cstart, cend){
#the function is executed for each row.
#loop through our "area of interests"
for(j in 1:nrow(alignments)){
chr_info <- extract_information_from_alignment(alignments[j, ])
interesting <- 0
uninteresting <- 0
hits_index_start <- cstart
hits_length <- 10 ## the length of our matrix
aoe_start <- as.numeric(chr_info[["start_index"]])
aoe_end <- chr_info[["end_index"]]
## go through EACH INDEX AND CALCULATE THE OVERLAP
for(j in hits_index_start:(hits_index_start + hits_length-1)){
if((aoe_start <= j) && (j <= aoe_end)){
interesting <- interesting + 1
} else {
uninteresting <- uninteresting + 1
}
} #end for - interesting calc
substr_score = interesting/sum(interesting + uninteresting)
if(substr_score >= substr_score_threshold){
return(TRUE)
} else {
return(FALSE)
}
} #end for
}, .parallel = TRUE)
I ran this code last night for 13 hours and it wasn’t completed. I estimate it takes about 16 hours..
Ideas
I think the bottleneck is that for each row, it has a for loop (100 elements) and within that for loop, there is a second one that calculates the overlap.
But I am thinking, I don’t need to go through EACH alignment. I should first test if my cstart
even is close to the area of interest.
Maybe some sort of binary search?
As you’ve noticed, nested loops can get to be expensive. 8 million iterations of the outer loop times 100 iterations of the middle loop times (I’m not sure) iterations of the inner loop leads to a lot of processing.
I think you’re on the right track though – try to eliminate as many of those 8 million coords as possible before doing any processing on them. Binary Search is definitely your friend.
The approach I’d take would look roughly like
sort the coords list // (done once, use a library function, happens fast)
create your output array, initialize all to 'false'
for each alignment in alignments // this loop should iterate 100x
// the first alignment value that would fit your 80% coverage
// would be alignment.start_index - 2 (since all coords are of length 10
// and you need an 80% overlap)
let searchStart = alignment.start_index - 2
call a binary search function for searchStart in the coords list
// you'll either find the first area of interest, or the location
// in the list where that first one would go.
// do the same for the search end (since the alignment appears to be variable length)
// and because iterating over the list of coords from the starting point
// would probably be slower
I would expect this to run non-parallel fairly quickly – certainly not requiring hours.
All it’s doing is calling binary Search on an 8 million entry list, 100 or 200 times (depending on if you search for the end_index or iterate to it).
0