I have a toy dataframe where the first 3 columns signify chromosome location, and the fourth column shows whether consecutive rows overlap.
df <- data.frame(
chrom = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
start = c(10, 20, 25, 30, 90, 100),
end = c(20, 30, 38, 40, 120, 200),
count = c("no_overlap", "overlap", "overlap", "overlap", "overlap", "overlap")
)
Now I’d like to create a new dataframe in which the region breaks if there is an overlapping region. Such as, from 2nd read (20 30
) becomes (20 24 1
), where 1 is the count (end is the subtracting 1 from the next start, 24=25-1). However, the 3rd read (chr1 25 38 2
) location would be unchanged but count will become 2 as it overlaps with the previous read (25-30) and also with subsequent read (31-38). Similarly for other reads as well.
This is our anticipated output.
chrom start end count
<chr> <dbl> <dbl> <int>
chr1 10 20 1
chr1 20 24 1
chr1 25 38 2
chr1 39 40 1
chr1 90 99 1
chr1 100 120 2
chr1 121 200 1
Hopefully, I’m able to convey my message properly.
1