Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R in the general case?
The code below works, but it does not short-circuit, so when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not…
set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4
6
An Rcpp
for loop should be fast even if the first non-NA
value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.
Rcpp::cppFunction("int which_non_na(LogicalVector x) {
R_xlen_t n = x.size();
for (R_xlen_t i = 0; i < n; ++i) {
if (!LogicalVector::is_na(x[i])) {
return i + 1; // +1 for 1-indexing in R
}
}
return NA_INTEGER; // if no non-NA values are found
}")
which_non_na(x) # 4
Benchmarks
Here are some benchmarks for some vectors of length 1e4
to 1e8
and between 1e3
and 1e6
non-NA
values. With a vector of length 10,000 and 1,000 non-NA values it is around 50 times faster than the base R approach in the question. For the shortest vectors, the approach is around the same speed (or even slightly slower than) the pure R nonNA()
function in the answer by Carl Witthoft. We see the benefits with the longest vectors and the most non-NA
values (so we can exit earlier).
With a vector of length 100 million and 1 million non-NA
values, the median time of the Rcpp
approach is around 600,000 times faster than the approach in the question. The pure R nonNA()
function falls further behind Rcpp
(although it’s still decent and only 7 times as slow). The pure R min(which.min(x), which.max(x), Inf)
approach by ThomasIsCoding is pretty close with a relative speed of 1.86
compared with Rcpp
.
set.seed(1)
nonNA <- (x) { for (jn in 1:length(x)) { if (!is.na(x[jn])) { return(jn) } }; NA }
results <- bench::press(
vec_length = 10^(4:8),
num_non_na = c(1e3, 1e5, 1e6),
{
# create vector with non-NA
if (num_non_na >= vec_length) num_non_na <- vec_length - 1
x <- sample(c(rep(c(TRUE, FALSE), length.out = num_non_na), rep(NA, vec_length - num_non_na)))
bench::mark(
relative = TRUE,
base_r = min(which(!is.na(x))),
which_min = min(which.min(x), which.max(x), Inf),
which_1 = which(!is.na(x))[1],
position = Position((x) !is.na(x), x),
non_na = nonNA(x),
rcpp = which_non_na(x)
)
}
)
Output:
expression vec_length num_non_na min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm>
1 base_r 10000 1000 54.2 57.8 1 Inf Inf 9978 22 193.91ms
2 which_min 10000 1000 1.63 1.66 36.6 NaN NaN 10000 0 5.31ms
3 which_1 10000 1000 53.0 55.9 1.16 Inf Inf 9977 23 167.24ms
4 position 10000 1000 2.86 3.11 20.0 NaN NaN 10000 0 9.71ms
5 non_na 10000 1000 1 1 51.3 Inf NaN 10000 0 3.79ms
6 rcpp 10000 1000 1.06 1.07 47.3 NaN NaN 10000 0 4.11ms
7 base_r 100000 1000 326. 353. 1.00 Inf Inf 3449 83 397.84ms
8 which_min 100000 1000 1.73 1.73 200. NaN NaN 10000 0 5.79ms
9 which_1 100000 1000 315. 358. 1 Inf Inf 3366 82 389.22ms
10 position 100000 1000 7.68 8.26 41.4 NaN Inf 9999 1 27.96ms
11 non_na 100000 1000 2.35 2.52 137. NaN NaN 10000 0 8.42ms
12 rcpp 100000 1000 1 1 345. NaN NaN 10000 0 3.35ms
13 base_r 1000000 1000 1257. 1514. 1.00 Inf Inf 223 82 302.37ms
14 which_min 1000000 1000 2.76 2.73 629. NaN NaN 10000 0 21.6ms
15 which_1 1000000 1000 1278. 1491. 1 Inf Inf 226 76 306.97ms
16 position 1000000 1000 634. 643. 2.68 NaN Inf 917 23 463.92ms
17 non_na 1000000 1000 134. 143. 12.0 NaN Inf 3787 41 429.6ms
18 rcpp 1000000 1000 1 1 1720. NaN NaN 10000 0 7.9ms
19 base_r 10000000 1000 10613. 11683. 1 Inf Inf 25 25 518.79ms
20 which_min 10000000 1000 3.23 3.23 3493. NaN NaN 10000 0 59.41ms
21 which_1 10000000 1000 10445. 10436. 1.10 Inf Inf 27 27 510.68ms
22 position 10000000 1000 882. 921. 11.1 NaN Inf 269 22 500.75ms
23 non_na 10000000 1000 189. 205. 49.8 NaN Inf 1201 42 500.07ms
24 rcpp 10000000 1000 1 1 11100. NaN NaN 10000 0 18.7ms
25 base_r 100000000 1000 39833. 39765. 1.01 Inf Inf 3 6 552.89ms
26 which_min 100000000 1000 10.3 10.3 3955. NaN NaN 10000 0 468.35ms
27 which_1 100000000 1000 40767. 40599. 1 Inf Inf 3 6 555.68ms
28 position 100000000 1000 1020. 1095. 34.2 NaN Inf 93 21 503.13ms
29 non_na 100000000 1000 211. 230. 164. NaN Inf 444 43 500.34ms
30 rcpp 100000000 1000 1 1 40816. NaN NaN 10000 0 45.38ms
31 base_r 10000 100000 57.8 74.9 1 Inf Inf 9998 2 295.12ms
32 which_min 10000 100000 1.53 1.54 60.2 NaN NaN 10000 0 4.9ms
33 which_1 10000 100000 47.3 50.9 1.93 Inf Inf 9998 2 153ms
34 position 10000 100000 2.85 2.97 32.3 NaN NaN 10000 0 9.13ms
35 non_na 10000 100000 1 1 92.9 NaN NaN 10000 0 3.18ms
36 rcpp 10000 100000 1.06 1.07 85.3 NaN NaN 10000 0 3.46ms
37 base_r 100000 100000 573. 593. 1 Inf Inf 2640 5 489.41ms
38 which_min 100000 100000 1.55 1.55 379. NaN NaN 10000 0 4.9ms
39 which_1 100000 100000 453. 479. 1.31 Inf Inf 3453 7 488.38ms
40 position 100000 100000 2.87 2.98 202. NaN NaN 10000 0 9.16ms
41 non_na 100000 100000 1 1 578. NaN NaN 10000 0 3.21ms
42 rcpp 100000 100000 1.06 1.06 578. NaN NaN 10000 0 3.21ms
43 base_r 1000000 100000 5830. 6048. 1 Inf Inf 244 5 482.78ms
44 which_min 1000000 100000 1.48 1.49 3903. NaN NaN 10000 0 5.07ms
45 which_1 1000000 100000 5717. 5937. 1.06 Inf Inf 261 4 487.61ms
46 position 1000000 100000 13.0 13.3 463. NaN Inf 9998 2 42.69ms
47 non_na 1000000 100000 3.48 3.65 1660. NaN NaN 10000 0 11.92ms
48 rcpp 1000000 100000 1 1 5784. NaN Inf 9999 1 3.42ms
49 base_r 10000000 100000 40802. 40182. 1 Inf Inf 24 7 367.53ms
50 which_min 10000000 100000 1.85 1.83 22655. NaN NaN 10000 0 6.76ms
51 which_1 10000000 100000 40334. 40123. 1.02 Inf Inf 23 10 344.07ms
52 position 10000000 100000 92.3 91.3 471. NaN Inf 9984 16 324.88ms
53 non_na 10000000 100000 20.3 21.1 2043. NaN Inf 9993 7 74.92ms
54 rcpp 10000000 100000 1 1 42212. NaN NaN 10000 0 3.63ms
55 base_r 100000000 100000 374396. 366226. 1.05 Inf Inf 3 6 548.73ms
56 which_min 100000000 100000 4.40 4.35 88633. NaN NaN 10000 0 21.62ms
57 which_1 100000000 100000 373680. 366947. 1 Inf Inf 3 6 574.97ms
58 position 100000000 100000 359. 364. 1029. NaN Inf 2685 23 500.11ms
59 non_na 100000000 100000 75.7 80.2 4240. NaN Inf 10000 38 452.07ms
60 rcpp 100000000 100000 1 1 387345. NaN NaN 10000 0 4.95ms
61 base_r 10000 1000000 59.1 63.5 1 Inf Inf 9998 2 210.55ms
62 which_min 10000 1000000 1.55 1.55 42.9 NaN NaN 10000 0 4.91ms
63 which_1 10000 1000000 45.6 50.6 1.42 Inf Inf 9998 2 148.62ms
64 position 10000 1000000 2.85 2.94 23.3 NaN NaN 10000 0 9.03ms
65 non_na 10000 1000000 1 1 66.8 NaN NaN 10000 0 3.15ms
66 rcpp 10000 1000000 1.06 1.06 64.7 NaN NaN 10000 0 3.25ms
67 base_r 100000 1000000 538. 562. 1 Inf Inf 2801 6 489.45ms
68 which_min 100000 1000000 1.46 1.55 247. NaN NaN 10000 0 7.08ms
69 which_1 100000 1000000 432. 456. 1.22 Inf Inf 3404 7 488.07ms
70 position 100000 1000000 2.71 2.91 185. NaN NaN 10000 0 9.44ms
71 non_na 100000 1000000 1 1 520. NaN NaN 10000 0 3.36ms
72 rcpp 100000 1000000 1.06 1.07 504. NaN NaN 10000 0 3.47ms
73 base_r 1000000 1000000 5571. 5799. 1 Inf Inf 260 6 481.33ms
74 which_min 1000000 1000000 1.54 1.52 3729. NaN NaN 10000 0 4.96ms
75 which_1 1000000 1000000 4557. 4637. 1.29 Inf Inf 336 7 481.23ms
76 position 1000000 1000000 2.84 2.89 2004. NaN NaN 10000 0 9.24ms
77 non_na 1000000 1000000 1 1 5494. NaN NaN 10000 0 3.37ms
78 rcpp 1000000 1000000 1.06 1.02 5732. NaN NaN 10000 0 3.23ms
79 base_r 10000000 1000000 69156. 69066. 1 Inf Inf 17 5 377.72ms
80 which_min 10000000 1000000 1.46 1.47 43589. NaN NaN 10000 0 5.1ms
81 which_1 10000000 1000000 67884. 66517. 1.05 Inf Inf 19 5 400.27ms
82 position 10000000 1000000 4.03 4.16 15632. NaN NaN 10000 0 14.21ms
83 non_na 10000000 1000000 1.32 1.31 44276. NaN NaN 10000 0 5.02ms
84 rcpp 10000000 1000000 1 1 66408. NaN Inf 9999 1 3.35ms
85 base_r 100000000 1000000 662166. 619018. 1.04 Inf Inf 3 4 635.95ms
86 which_min 100000000 1000000 1.90 1.86 326808. NaN NaN 10000 0 6.72ms
87 which_1 100000000 1000000 589090. 557688. 1 Inf Inf 3 3 659.17ms
88 position 100000000 1000000 31.4 31.0 19264. NaN Inf 10000 4 114.06ms
89 non_na 100000000 1000000 7.14 7.59 74081. NaN Inf 10000 2 29.66ms
90 rcpp 100000000 1000000 1 1 62977. NaN Inf 10000 1 34.89ms
Plot of results (log time scale):
13
You can use which.min
and which.max
(without is.na()
), since as said in the manual:
Missing and NaN values are discarded.
set.seed(1)
x <- c(rep(NA, 3), sample(c(T, F), size = 5e7, replace = T))
f1 <- () min(which(!is.na(x)))
f2 <- () which.min(is.na(x))
f3 <- () min(which.min(x), which.max(x), Inf)
microbenchmark(
f1 = f1(),
f2 = f2(),
f3 = f3(),
unit = "relative",
check = "equal"
)
and you will see
Unit: relative
expr min lq mean median uq max neval
f1 931928.5 596652.0 12690.309 31796.371 33326.273 232.24601 100
f2 139656.2 98002.5 2770.152 5879.894 8057.321 84.39669 100
f3 1.0 1.0 1.000 1.000 1.000 1.00000 100
Remark
From the benchmarking, it seems is.na
is an expensive operator from the speed perspective, especially when the size of array scales up
9
Very late to the party but didn’t see anyone mention Position() that does what you are looking for and stops scanning as soon as a match is found:
Position((x) !is.na(x), x) # 4
0
on an Intel Mac, OS 13.5.2
Edit: retest with purely logical input
x <- as.logical(c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen)))
Rgames> microbenchmark(which_non_na(x),nonNA(x))
Unit: nanoseconds
expr min lq mean median uq max neval
which_non_na(x) 691 735.0 1192.88 898.0 1342.5 16088 100
nonNA(x) 45624 46285.5 47195.47 46569.5 46928.5 102274 100
Dunno how c++ would fare if it ran on numeric inputs but didn’t declare “logical.
nonNA <- function(x) {
for(jn in 1:length(x)) {
if(!is.na(x[jn])) return( jn)
}
return('not found')
}
biglen=1e6
x <- c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen))
microbenchmark(which_non_na(x),nonNA(x))
Unit: microseconds
expr min lq mean median uq max neval
which_non_na(x) 1432.809 1596.131 2398.95178 1606.1205 1615.342 11521.484 100
nonNA(x) 46.436 47.125 55.06929 47.8625 56.734 111.771 100
biglen=1e8
x <- c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen))
microbenchmark(which_non_na(x),nonNA(x))
Unit: microseconds
expr min lq mean median uq max neval
which_non_na(x) 322243.196 327702.4310 349913.34584 353146.7080 370955.373 385839.443 100
nonNA(x) 46.642 48.3305 86.83075 100.3525 115.982 256.817 100
4
A little faster than using min
and base R
which(!is.na(x))[1]
finds the first non-NA value from 100,000,000 values in less than a second, even on a crappy cpu.
7