I have vectors that index the events A, B, C. Each event occurs at least once but can occur multiple times. For example:
- Sequence 1:
c("A", "B", "C")
- Sequence 2:
c("A", "A", "B", "B", "C")
- Sequence 3:
c("B", "A", "C")
- Sequence 4:
c("C", "B", "A")
- Sequence 5:
c("A", "B", "B", "C", "A")
For each sequence, I want to identify events that occur in the following order: A-B-C. I want to index these sequences with a vector of 0/1’s indicating whether or not the matches this condition. There should only be one A or C per sequence, but multiple B’s between them are acceptable. When events are not in the order A-B-C, they do not match the condition. These rules should return the vectors:
- Sequence 1:
c(1,1,1)
- Sequence 2:
c(0,1,1,1,1)
- Sequence 3:
c(0,0,0)
- Sequence 4:
c(0,0,0)
- Sequence 5:
c(1,1,1,1,0)
I’m not sure what general type of problem this falls under so am having trouble searching for a solution. Any suggestions would be appreciated!
Edit: open to either a base R solution applied to each vector (e.g., via lapply
) or a tidyverse solution would be more appropriate so edited my question to include both list of vectors and a data frame with input/desired output
Edit^2: added an additional test case
## sequences
s1<-c("A", "B", "C")
s2<-c("A", "A", "B", "B", "C")
s3<-c("B", "A", "C")
s4<-c("C", "B", "A")
s5<-c("A", "B", "B", "C", "A")
## make into a list
s.list <- list(s1,s2,s3,s4,s5)
## s.list as a data frame
# /questions/57270001/list-to-dataframe-conversion-in-r-keeping-list-indexes
s.df <- s.list %>%
purrr::map(~as_tibble(.)) %>%
dplyr::bind_rows(.id="group")
## solution...
## desired output
# as a list
s.indexed <- list(c(1,1,1),
c(0,1,1,1,1),
c(0,0,0),
c(0,0,0),
c(1,1,1,1,0)
)
s.indexed
# as a data frame
s.df <- s.df %>%
bind_cols(index = unlist(s.indexed))
s.df
An attempt, which relies on more manual rules then I hoped to deal with the allowed duplication of B, but not A or C:
result <- lapply(s.list, (x) {
ox <- ordered(x, levels=c("A","B","C"))
os <- is.unsorted(ox)
out <- rep(0, length(ox))
if(!os) {
out[ox == "A" & (!duplicated(ox, fromLast=TRUE))] <- 1
out[ox == "C" & (!duplicated(ox))] <- 1
out[ox == "B"] <- 1
}
out
})
result
#[[1]]
#[1] 1 1 1
#
#[[2]]
#[1] 0 1 1 1 1
#
#[[3]]
#[1] 0 0 0
#
#[[4]]
#[1] 0 0 0
Checks out:
identical(result, s.indexed)
#[1] TRUE
The base logic is to use an ordered
factor with the desired A-B-C order, which allows for identifying the unsorted vectors first. Then, only the last A and the first C are marked as 1 using duplicated
flagging.
2
This uses str_locate
to find the desired characters/events. It works for one occurrence of the event per entry. It’s possible to extend this to work on any number of occurrences (by using str_locate_all
) but it would add some layers of complexity.
library(stringr)
lapply(s.list, (x){
strng <- paste0(x, collapse="")
loc <- str_locate(strng, "AB+C")
zeros <- rep(0, nchar(strng))
if(!is.na(loc[1])){
zeros[seq(loc[1], loc[2])] <- 1
}
zeros
})
[[1]]
[1] 1 1 1
[[2]]
[1] 0 1 1 1 1
[[3]]
[1] 0 0 0
[[4]]
[1] 0 0 0
[[5]]
[1] 1 1 1 1 0
Using the input list L
shown in the Note at the end this, like another answer, converts the input vectors to strings and acts on them using a regular
expression but uses different functions and arranges it in a pipeline.
The sapply
line converts each vector into a character string; the next
line replaces the target strings with the same number of 1’s; the next
line replaces all other characters with 0’s and the last two lines convert
from a character vector to a list of numeric vectors.
If a character vector result c("111", "01111", ...)
is ok then the last
two lines can be omitted.
library(gsubfn)
L |>
sapply(paste, collapse = "") |>
gsubfn("(AB+C)", ~ strrep("1",nchar(x)), x = _) |>
gsub("[^1]", "0", x = _) |>
strsplit("") |>
lapply(as.numeric)
Note
Input used
L <- list(
c("A", "B", "C") ,
c("A", "A", "B", "B", "C") ,
c("B", "A", "C") ,
c("C", "B", "A"),
c("A", "B", "B", "C", "A")
)
is_next_legal <- function(cur, next_elem) {
(cur == "A" && next_elem == "B") || (cur == "B" && next_elem %in% c("B", "C"))
}
foo <- function(x) {
n <- length(x)
out <- integer(n)
i <- 1L
p <- 1L
while (i <= (n-2L)) {
if (x[i] == "A") {
p <- i
while (p <= (n-1L) && is_next_legal(x[p], x[p+1])) {
p <- p + 1L
}
if (x[p] == "C") out[i:p] <- 1L
}
i <- max(i, p) + 1L
}
out
}
lapply(s.list, foo)
[[1]]
[1] 1 1 1
[[2]]
[1] 0 1 1 1 1
[[3]]
[1] 0 0 0
[[4]]
[1] 0 0 0
[[5]]
[1] 1 1 1 1 0
2