These are some exemple vectors to reproduce:
a <- c(14,26,38,64,96,127,152,152,152,152,152,152)
b <- c(4,7,9,13,13,13,13,13,13,13,13,13,13,13)
c <- c(62,297,297,297,297,297,297,297,297,297,297,297)
It is obvious that at some point a certain value is repeated until the end. I need to get exactly the index where this values appears for the first time.
So in this case the output would be 7,4,2
, since in a
152
starts at the 7th position, in b
13
starts at the 4th position and in c
297
starts at the 2nd position.
I hope this is clear.
Anybody with a hint how to get this automatically?
Edit: the data is always increasing and once it starts repeating it continues until the end. In this kind of analysis there will always be a repetition at least at the last two values.
5
You could use rle()
to take the run-length encoding of every value except the final one and sum their lengths:
get_index <- (x) sum(head(rle(x)$lengths, -1)) + 1
sapply(list(a, b, c), get_index)
# [1] 7 4 2
Rcpp solution
If your vectors are really long and the last value is only repeated towards the end, you don’t need to check the length of every run, so the above will be inefficient. It’s better to start from the end of the vector and work backwards until you find a different value:
Rcpp::cppFunction('
int get_index2(NumericVector x) {
int n = x.size();
double last_value = x[n - 1];
for (int i = n - 2; i >= 0; --i) {
if (x[i] != last_value) {
return i + 2; // +1 as it is next element; +1 for 1-indexing
}
}
return 1; // all elements are the same
}
')
sapply(list(a,b,c), get_index2)
# [1] 7 4 2
data.table
solution
Given your update the question, another way to approach this would be:
sapply(list(a,b,c), data.table::uniqueN)
# [1] 7 4 2
This is not conceptually different from the nice answer by zx8754 and with vectors of this size is unlikely to be meaningfully different in speed and could even be slower. However, it is faster for very large vectors.
1
If you know the last value is the repeated value then you can use that and match()
, which finds the index of the first value of a match:
first <- (x) match(x[length(x)], x)
sapply(list(a, b, c), first)
# 7 4 2
If you’re looking for the first successive value then you can use diff()
and which()
:
first_conseq <- (x) which(diff(x) == 0)[1]
sapply(list(a, b, c), first_conseq)
# 7 4 2
By default, diff()
returns the difference between successive values. If two values are the same then their difference will be 0. which()
will return the index of all TRUE
values in a logical vector so we use [1]
to take the first case.
1
As clarified by OP, if the data is always increasing and starts duplicating on the last value, we just need to check unique length:
lengths(lapply(list(a, b, c), unique))
# [1] 7 4 2
1
Another base R solution:
f <- (x) (length(x) - which.max(rev(x) != x[length(x)]) + 1L)%%length(x) + 1L
I’ll compare it to the other options along with some benchmarking. Tossing in a couple edge cases:
a <- c(14,26,38,64,96,127,152,152,152,152,152,152)
b <- c(4,7,9,13,13,13,13,13,13,13,13,13,13,13)
c <- c(62,297,297,297,297,297,297,297,297,297,297,297)
d <- numeric(12)
e <- 1:14
Testing the proposed answers, including the edge cases:
get_index <- (x) sum(head(rle(x)$lengths, -1)) + 1L
Edward <- (a) length(a) - min(which(diff(rev(a))!=0)) + 1L
first_conseq <- (x) which(diff(x) == 0)[1]
sapply(list(a, b, c, d, e), f)
#> [1] 7 4 2 1 14
sapply(list(a, b, c, d, e), get_index)
#> [1] 7 4 2 1 14
sapply(list(a, b, c, d, e), Edward)
#> Warning in min(which(diff(rev(a)) != 0)): no non-missing arguments to min;
#> returning Inf
#> [1] 7 4 2 -Inf 14
sapply(list(a, b, c, d, e), first_conseq)
#> [1] 7 4 2 1 NA
And SamR’s Rcpp function (modified slightly for speed):
Rcpp::cppFunction('
int get_index2(const NumericVector& x) {
const int n = x.size();
const double last_value = x[n - 1];
for (int i = n - 2; i >= 0; --i) {
if (x[i] != last_value) {
return i + 2; // +1 as it is next element; +1 for 1-indexing
}
}
return 1; // all elements are the same
}
')
sapply(list(a, b, c, d, e), get_index2)
#> [1] 7 4 2 1 14
Only f
and the get_index
functions behave well with the edge cases.
Benchmarking with a larger dataset:
n <- sample(1e5, 1e3, 1)
x <- lapply(n, (n) c(sample(1e4, n, 1), 0L, sample(1e5 - n, 1))[-1:-2])
identical(n, vapply(x, f, 0L))
#> [1] TRUE
bench::mark(
f = vapply(x, f, 0L),
get_index = vapply(x, get_index, 0L)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 f 306.6ms 316.29ms 3.16 580.07MB 12.6
#> 2 get_index 2.46s 2.46s 0.406 4.91GB 13.8
#> 3 get_index2 62.4ms 67.14ms 14.6 404.34MB 42.0
0
You can try
f <- (x) {
length(x) - which.min(replace(rev(duplicated(x, fromLast = TRUE)), 1, TRUE)) + 2
}
such that
> lapply(list(a, b, c), f)
[[1]]
[1] 7
[[2]]
[1] 4
[[3]]
[1] 2
Since the data is always increasing and once it starts repeating it continues until the end, you can simply do:
min(which(diff(a)==0))
#[1] 7
sapply(list(a, b, c), (x) min(which(diff(x)==0)))
[1] 7 4 2
If the last condition is relaxed, you can reverse the vector and use diff
to find the first occurrence of a non-zero.
length(a) - min(which(diff(rev(a))!=0)) + 1
# [1] 7
x <- c(1,2,2,3,4,5,5,5,5,5,5)
length(x) - min(which(diff(rev(x))!=0)) + 1
#[1] 6
1
Another base R solution. Applying duplicated
gives a logical array with the first TRUE
value at the target index plus 1, which
extracts the index. I’ve added the “edge” cases considered by @jblood94 above. Although these cases are not included in OP question, seems if no repeats function should return NA.
a <- c(14, 26, 38, 64, 96, 127, 152, 152, 152, 152, 152, 152)
b <- c(4, 7, 9, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13)
c <- c(62, 297, 297, 297, 297, 297, 297, 297, 297, 297, 297, 297)
d <- 12
e <- 1:14
pull_index <- (x) which(duplicated(x))[1] - 1
sapply(list(a, b, c, d, e), pull_index)
#
# [1] 7 4 2 NA NA