Thiết kế website giá rẻ

Question

I am looking for an algorithm (pseudocode, R code) for my problem here. This looks like a lot to read but is actually very very easy to read, and I am just trying to be verbose.

I have a list (say coords) with 8 million elements. They look like this:

<code>> head(coords)

cstart cend

1 35085 35094

2 35089 35098

3 35374 35383

4 35512 35521

5 35513 35522

6 35514 35523

...

98 12600309 12600318

</code>

<code>> head(coords) cstart cend 1 35085 35094 2 35089 35098 3 35374 35383 4 35512 35521 5 35513 35522 6 35514 35523 ... 98 12600309 12600318 </code>

> head(coords)
  cstart  cend
1  35085 35094
2  35089 35098
3  35374 35383
4  35512 35521
5  35513 35522
6  35514 35523
...
98 12600309 12600318

What these elements represent are coordinates on a DNA string. Ie, 35374 refers to the position on the DNA string. I don’t care whether its a A C G T for what its worth. AND ALL coords are length 10. So the cend field can be calculated. (I don’t actually use this cend in my code)

I have a secondary list (say alignments) (~100 elements) that also have a start and end coordinate. This defines are area of interest. It looks like this

<code> chr_name chr_strand start_index end_index

"chr1" "+" "12600311" "12600325"

</code>

<code> chr_name chr_strand start_index end_index "chr1" "+" "12600311" "12600325" </code>

 chr_name       chr_strand      start_index        end_index
  "chr1"              "+"       "12600311"          "12600325"

The columns chr_name, chr_strand is not important. In this case, our “area of interest” could be larger than 10.

My problem

Now, what I want to do is go through all 8 million records and see if each record is contained in the interests list (up to 80% overlap, I will explain this). For the above example, the first 6 elements are nowhere “near” the area of interest.

Row 98 starts 2 “base pairs” before our area of interest, BUT overlaps the area of interest up to 80%. Ie, 80% of the “site” is contained in the area of interest.

Result

Basically, I am looking for a list (I can preallocate it), where each element is a TRUE/FALSE. So from above I’ll have a list like the following

FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ..., TRUE (97'th index)

My current R code
I have documented R code that is very easy to read

<code> ## mdply is a function that acts essentially like a loop

## the .data is the data I am working with.. the 8 million records

## for each row, it executes a function where the signature of the function

## matches the column names.

## the best part about this .. IS THAT ITS PARALLELIZED

## and I can use up to 24 cores to do this.

b <- mdply(.data = coords,

function(cstart, cend){

#the function is executed for each row.

#loop through our "area of interests"

for(j in 1:nrow(alignments)){

chr_info <- extract_information_from_alignment(alignments[j, ])

interesting <- 0

uninteresting <- 0

hits_index_start <- cstart

hits_length <- 10 ## the length of our matrix

aoe_start <- as.numeric(chr_info[["start_index"]])

aoe_end <- chr_info[["end_index"]]

## go through EACH INDEX AND CALCULATE THE OVERLAP

for(j in hits_index_start:(hits_index_start + hits_length-1)){

if((aoe_start <= j) && (j <= aoe_end)){

interesting <- interesting + 1

} else {

uninteresting <- uninteresting + 1

}

} #end for - interesting calc

substr_score = interesting/sum(interesting + uninteresting)

if(substr_score >= substr_score_threshold){

return(TRUE)

} else {

return(FALSE)

}

} #end for

}, .parallel = TRUE)

</code>

<code> ## mdply is a function that acts essentially like a loop ## the .data is the data I am working with.. the 8 million records ## for each row, it executes a function where the signature of the function ## matches the column names. ## the best part about this .. IS THAT ITS PARALLELIZED ## and I can use up to 24 cores to do this. b <- mdply(.data = coords, function(cstart, cend){ #the function is executed for each row. #loop through our "area of interests" for(j in 1:nrow(alignments)){ chr_info <- extract_information_from_alignment(alignments[j, ]) interesting <- 0 uninteresting <- 0 hits_index_start <- cstart hits_length <- 10 ## the length of our matrix aoe_start <- as.numeric(chr_info[["start_index"]]) aoe_end <- chr_info[["end_index"]] ## go through EACH INDEX AND CALCULATE THE OVERLAP for(j in hits_index_start:(hits_index_start + hits_length-1)){ if((aoe_start <= j) && (j <= aoe_end)){ interesting <- interesting + 1 } else { uninteresting <- uninteresting + 1 } } #end for - interesting calc substr_score = interesting/sum(interesting + uninteresting) if(substr_score >= substr_score_threshold){ return(TRUE) } else { return(FALSE) } } #end for }, .parallel = TRUE) </code>

  ## mdply is a function that acts essentially like a loop
  ## the .data is the data I am working with.. the 8 million records
  ## for each row, it executes a function where the signature of the function
  ## matches the column names. 
  ## the best part about this .. IS THAT ITS PARALLELIZED
  ## and I can use up to 24 cores to do this.
  b <- mdply(.data = coords, 
             function(cstart, cend){   

               #the function is executed for each row. 
               #loop through our "area of interests"
               for(j in 1:nrow(alignments)){
                 chr_info <- extract_information_from_alignment(alignments[j, ])
                 interesting <- 0
                 uninteresting <- 0
                 hits_index_start <- cstart
                 hits_length <- 10 ## the length of our matrix               
                 aoe_start <- as.numeric(chr_info[["start_index"]])
                 aoe_end <- chr_info[["end_index"]]    

                 ## go through EACH INDEX AND CALCULATE THE OVERLAP
                 for(j in hits_index_start:(hits_index_start + hits_length-1)){
                   if((aoe_start <= j) && (j <= aoe_end)){
                     interesting <- interesting + 1
                   } else {
                     uninteresting <- uninteresting + 1
                   }          
                 } #end for - interesting calc
                 substr_score = interesting/sum(interesting + uninteresting)
                 if(substr_score >= substr_score_threshold){
                   return(TRUE)
                 } else { 
                   return(FALSE)
                 }      
               } #end for
             }, .parallel = TRUE)

I ran this code last night for 13 hours and it wasn’t completed. I estimate it takes about 16 hours..

Ideas
I think the bottleneck is that for each row, it has a for loop (100 elements) and within that for loop, there is a second one that calculates the overlap.

But I am thinking, I don’t need to go through EACH alignment. I should first test if my cstart even is close to the area of interest.

Maybe some sort of binary search?

Thiết kế website giá rẻ

Danh mục

Algorithm for a fast search