I am currently working on identifying the largest overlaps of features in samples. For this purpose, I have a data.frame with n samples. Each sample has the same features i. In addition, I have the column state which indicates whether the feature is available for the sample:
set.seed(1234)
n = 20 # samples
i = 20 # features
dat <- data.frame(
sample = rep(x = paste0("s", 1:n), each = i),
feature = rep(x = paste0("f", 1:i), times = n),
state = sample(x = c(0, 1), size = n*i, replace = TRUE))
My goal is to determine number of samples (and their ids) that have 5 to 10 identical characteristics. I’ve tried something like:
dat |>
dplyr::filter(state != 0) |>
group_by(feature) |> summarize(n = n(), .groups = "drop")
However, I do not get all possible combinations. So I was thinking of creating an analysis that is comparable to an upset plot. Do you have an approach on how to realize this?