Assume you have vector numeric vector x
and a data frame df
with columns start
and stop
. Is there a clever way to return a logical vector with length equal to x
indicating if x is in at least one interval defined by start
or stop
?
The actual case I’m working with has length(x)
>> nrow(df)
. The naïve way to do this would be using a for loop but I was hoping for something more elegant and that runs fast.
x <- 1:10
df <- data.frame(start = c(0, 4.5, 6), stop = c(1, 5.5, 8.5))
z <- rep(FALSE, length(x))
for(i in 1:nrow(df)){
z <- z | (df$start[i] <= x & x <= df$stop[i])
}
x[z] # 1 5 6 7 8
1
Maybe you can use outer
like below
> with(df, x[rowMeans(outer(x, start, `>=`) & outer(x, stop, `<=`)) > 0])
[1] 1 5 6 7 8
or, you can use %inrange%
from data.table
> library(data.table)
> x[x %inrange% df]
[1] 1 5 6 7 8
5
This is maybe too clever, but does solve your problem. (@ThomasIsCoding’s is very similar but slightly more compact …)
x_gt_start <- outer(x,df$start, ">=")
x_lt_stop <- outer(x,df$stop, "<=")
between <- rowSums(x_gt_start & x_lt_stop) > 0 ## or: apply(..., 1, any)
x[between]
Even faster would be to use Rcpp
to write code that looped and short-circuited (i.e., returned TRUE as soon as the condition was met once for a given x
, rather than going through all of the comparisons)
1
Edit: As user ThomasIsCoding kindly points out, there is data.table::inrange()
which is vectorised:
x = seq(10)
df = data.frame(start = c(0, 4.5, 6), stop = c(1, 5.5, 8.5))
x[data.table::inrange(x, df$start, df$stop)]
#> [1] 1 5 6 7 8
4
Here’s a solution I came up with using the findInterval()
function, which I think (the documentation doesn’t specify) uses a binary search algorithm. It assumes that your start and stop times are ordered and your stop times are less than the next start time (it might not require that last assumption; I haven’t tested it).
findInterval(x, df$start) == (findInterval(x, df$stop, left.open = T) + 1)
I ran various combinations of benchmarks compared to the countrange()
and inrange
solution, and my solution tends to range from mid to slowest, but overall they all seem to do pretty good. It’s hard to judge from arbitrary benchmarking which is best for OP because the relative performance of each will vary based on the absolute size of the objects, as well as the relative size difference between them. My solution seems to get worse as the length of the vector increases relative to the number of rows in the data table.
However, if the x
vector is sorted, then my solution will probably blow the others out of the water; findInterval()
is O(n) if x
is sorted.
In terms of memory used, it’s hard to tell which is best. I’m not aware of any benchmarking package in R that can properly capture C++ memory allocations. Plus, you have to figure out how it scales with the size of the variables. I expect findInterval()
to be very memory efficient regardless. I don’t know about the other solutions. My advice would be to go with whatever is fastest/fast enough (and maybe easy to read/understand), and if you have memory issues, try another, and so on.
4
A vectorized function to count the number of intervals of df
that contain the element of x
. It is faster than data.table::%inrange%
for large datasets.
library(data.table)
countrange <- function(x, df) {
out <- integer(length(x))
n <- nrow(df)
setorder(
data.table(
i = c(1:length(x), integer(n + n)),
r = c(x, unlist(df, 0, 0)),
s = rep.int(c(0L, 1L, -1L), c(length(x), n, n))
), r, -s
)[,s := cumsum(s)][i != 0L, {out[i] <- s; out}]
}
countrange(x, df)
#> [1] 1 0 0 0 1 1 1 1 0 0
Demonstrating on a set of potentially overlapping intervals:
x <- runif(10)
df <- data.table(start = sort(runif(1e5)), stop = sort(runif(1e5)))[
start > stop, `:=`(start = stop, stop = start)
]
countrange(x, df)
#> [1] 201 320 202 27 169 314 131 92 219 0
x %inrange% df
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Benchmarking
We can squeeze a little more speed with collapse::setv
.
library(collapse)
countrange2 <- function(x, df) {
n <- nrow(df)
m <- length(x)
out <- integer(m)
setorder(
data.table(
i = setv(integer(n + n + m), 1:m, 1:m),
r = c(x, unlist(df, 0, 0)),
s = rep.int(c(0L, 1L, -1L), c(m, n, n))
), r, -s
)[,s := cumsum(s)][i != 0L, {out[i] <- s; out}]
}
While both versions are faster than data.table::%inrange%
for large datasets, they also use more memory.
x <- runif(3e7)
df <- as.data.frame(matrix(sort(runif(8e3)), 4e3, 2, 1))
bench::mark(
countrange = countrange(x, df),
countrange2 = countrange2(x, df),
inrange = x %inrange% df,
check = FALSE,
iterations = 10
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 countrange 2.62s 2.71s 0.370 2.26GB 1.55
#> 2 countrange2 2.56s 2.62s 0.381 2.38GB 1.33
#> 3 inrange 4.27s 4.57s 0.214 457.54MB 0.128
2