Algorithm for a fast search

I am looking for an algorithm (pseudocode, R code) for my problem here. This looks like a lot to read but is actually very very easy to read, and I am just trying to be verbose.

I have a list (say coords) with 8 million elements. They look like this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>> head(coords)
cstart cend
1 35085 35094
2 35089 35098
3 35374 35383
4 35512 35521
5 35513 35522
6 35514 35523
...
98 12600309 12600318
</code>
<code>> head(coords) cstart cend 1 35085 35094 2 35089 35098 3 35374 35383 4 35512 35521 5 35513 35522 6 35514 35523 ... 98 12600309 12600318 </code>
> head(coords)
  cstart  cend
1  35085 35094
2  35089 35098
3  35374 35383
4  35512 35521
5  35513 35522
6  35514 35523
...
98 12600309 12600318 

What these elements represent are coordinates on a DNA string. Ie, 35374 refers to the position on the DNA string. I don’t care whether its a A C G T for what its worth. AND ALL coords are length 10. So the cend field can be calculated. (I don’t actually use this cend in my code)

I have a secondary list (say alignments) (~100 elements) that also have a start and end coordinate. This defines are area of interest. It looks like this

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> chr_name chr_strand start_index end_index
"chr1" "+" "12600311" "12600325"
</code>
<code> chr_name chr_strand start_index end_index "chr1" "+" "12600311" "12600325" </code>
 chr_name       chr_strand      start_index        end_index
  "chr1"              "+"       "12600311"          "12600325"

The columns chr_name, chr_strand is not important. In this case, our “area of interest” could be larger than 10.


My problem

Now, what I want to do is go through all 8 million records and see if each record is contained in the interests list (up to 80% overlap, I will explain this). For the above example, the first 6 elements are nowhere “near” the area of interest.

Row 98 starts 2 “base pairs” before our area of interest, BUT overlaps the area of interest up to 80%. Ie, 80% of the “site” is contained in the area of interest.

Result

Basically, I am looking for a list (I can preallocate it), where each element is a TRUE/FALSE. So from above I’ll have a list like the following

FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ..., TRUE (97'th index)

My current R code
I have documented R code that is very easy to read

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> ## mdply is a function that acts essentially like a loop
## the .data is the data I am working with.. the 8 million records
## for each row, it executes a function where the signature of the function
## matches the column names.
## the best part about this .. IS THAT ITS PARALLELIZED
## and I can use up to 24 cores to do this.
b <- mdply(.data = coords,
function(cstart, cend){
#the function is executed for each row.
#loop through our "area of interests"
for(j in 1:nrow(alignments)){
chr_info <- extract_information_from_alignment(alignments[j, ])
interesting <- 0
uninteresting <- 0
hits_index_start <- cstart
hits_length <- 10 ## the length of our matrix
aoe_start <- as.numeric(chr_info[["start_index"]])
aoe_end <- chr_info[["end_index"]]
## go through EACH INDEX AND CALCULATE THE OVERLAP
for(j in hits_index_start:(hits_index_start + hits_length-1)){
if((aoe_start <= j) && (j <= aoe_end)){
interesting <- interesting + 1
} else {
uninteresting <- uninteresting + 1
}
} #end for - interesting calc
substr_score = interesting/sum(interesting + uninteresting)
if(substr_score >= substr_score_threshold){
return(TRUE)
} else {
return(FALSE)
}
} #end for
}, .parallel = TRUE)
</code>
<code> ## mdply is a function that acts essentially like a loop ## the .data is the data I am working with.. the 8 million records ## for each row, it executes a function where the signature of the function ## matches the column names. ## the best part about this .. IS THAT ITS PARALLELIZED ## and I can use up to 24 cores to do this. b <- mdply(.data = coords, function(cstart, cend){ #the function is executed for each row. #loop through our "area of interests" for(j in 1:nrow(alignments)){ chr_info <- extract_information_from_alignment(alignments[j, ]) interesting <- 0 uninteresting <- 0 hits_index_start <- cstart hits_length <- 10 ## the length of our matrix aoe_start <- as.numeric(chr_info[["start_index"]]) aoe_end <- chr_info[["end_index"]] ## go through EACH INDEX AND CALCULATE THE OVERLAP for(j in hits_index_start:(hits_index_start + hits_length-1)){ if((aoe_start <= j) && (j <= aoe_end)){ interesting <- interesting + 1 } else { uninteresting <- uninteresting + 1 } } #end for - interesting calc substr_score = interesting/sum(interesting + uninteresting) if(substr_score >= substr_score_threshold){ return(TRUE) } else { return(FALSE) } } #end for }, .parallel = TRUE) </code>
  ## mdply is a function that acts essentially like a loop
  ## the .data is the data I am working with.. the 8 million records
  ## for each row, it executes a function where the signature of the function
  ## matches the column names. 
  ## the best part about this .. IS THAT ITS PARALLELIZED
  ## and I can use up to 24 cores to do this.
  b <- mdply(.data = coords, 
             function(cstart, cend){   

               #the function is executed for each row. 
               #loop through our "area of interests"
               for(j in 1:nrow(alignments)){
                 chr_info <- extract_information_from_alignment(alignments[j, ])
                 interesting <- 0
                 uninteresting <- 0
                 hits_index_start <- cstart
                 hits_length <- 10 ## the length of our matrix               
                 aoe_start <- as.numeric(chr_info[["start_index"]])
                 aoe_end <- chr_info[["end_index"]]    

                 ## go through EACH INDEX AND CALCULATE THE OVERLAP
                 for(j in hits_index_start:(hits_index_start + hits_length-1)){
                   if((aoe_start <= j) && (j <= aoe_end)){
                     interesting <- interesting + 1
                   } else {
                     uninteresting <- uninteresting + 1
                   }          
                 } #end for - interesting calc
                 substr_score = interesting/sum(interesting + uninteresting)
                 if(substr_score >= substr_score_threshold){
                   return(TRUE)
                 } else { 
                   return(FALSE)
                 }      
               } #end for
             }, .parallel = TRUE)

I ran this code last night for 13 hours and it wasn’t completed. I estimate it takes about 16 hours..

Ideas
I think the bottleneck is that for each row, it has a for loop (100 elements) and within that for loop, there is a second one that calculates the overlap.

But I am thinking, I don’t need to go through EACH alignment. I should first test if my cstart even is close to the area of interest.

Maybe some sort of binary search?

As you’ve noticed, nested loops can get to be expensive. 8 million iterations of the outer loop times 100 iterations of the middle loop times (I’m not sure) iterations of the inner loop leads to a lot of processing.

I think you’re on the right track though – try to eliminate as many of those 8 million coords as possible before doing any processing on them. Binary Search is definitely your friend.

The approach I’d take would look roughly like

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>sort the coords list // (done once, use a library function, happens fast)
create your output array, initialize all to 'false'
for each alignment in alignments // this loop should iterate 100x
// the first alignment value that would fit your 80% coverage
// would be alignment.start_index - 2 (since all coords are of length 10
// and you need an 80% overlap)
let searchStart = alignment.start_index - 2
call a binary search function for searchStart in the coords list
// you'll either find the first area of interest, or the location
// in the list where that first one would go.
// do the same for the search end (since the alignment appears to be variable length)
// and because iterating over the list of coords from the starting point
// would probably be slower
</code>
<code>sort the coords list // (done once, use a library function, happens fast) create your output array, initialize all to 'false' for each alignment in alignments // this loop should iterate 100x // the first alignment value that would fit your 80% coverage // would be alignment.start_index - 2 (since all coords are of length 10 // and you need an 80% overlap) let searchStart = alignment.start_index - 2 call a binary search function for searchStart in the coords list // you'll either find the first area of interest, or the location // in the list where that first one would go. // do the same for the search end (since the alignment appears to be variable length) // and because iterating over the list of coords from the starting point // would probably be slower </code>
sort the coords list // (done once, use a library function, happens fast)
create your output array, initialize all to 'false'
for each alignment in alignments // this loop should iterate 100x
  // the first alignment value that would fit your 80% coverage
  // would be alignment.start_index - 2 (since all coords are of length 10
  // and you need an 80% overlap)
  let searchStart = alignment.start_index - 2
  call a binary search function for searchStart in the coords list
  // you'll either find the first area of interest, or the location
  // in the list where that first one would go.  

  // do the same for the search end (since the alignment appears to be variable length)
  // and because iterating over the list of coords from the starting point
  // would probably be slower

I would expect this to run non-parallel fairly quickly – certainly not requiring hours.

All it’s doing is calling binary Search on an 8 million entry list, 100 or 200 times (depending on if you search for the end_index or iterate to it).

0

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật