I am trying to count the number of mismatches in values between two rows (two samples), going column by column in R. I want to only count mismatches that don’t include an NA. Here is some example data:
Sample1 0 1 1 2 NA 2 1 1
Sample1b 1 1 1 2 1 1 1 1
Sample2 2 2 1 0 0 NA 0 0
Sample2b 2 2 1 1 0 NA 0 0
Id like to do this in groups of two rows (comparing sample 1 to sample 1b, and then sample 2 to 2b, and so forth). In this example, Sample 1vsSample 1b would have 2/7 mismatches and sample 2 vs 2b would have just 1/7 (the NA comparisons are removed from the denominator).
I’ve tried using an apply over columns, comparing each line, but haven’t gotten that far.
You can try
rev(
stack(
lapply(
split(df, sub("\D+$", "", row.names(df))),
(x) mean(Reduce(`!=`, as.data.frame(na.omit(t(x)))))
)
)
)
which gives
ind values
1 Sample1 0.2857143
2 Sample2 0.1428571
data
> dput(df)
structure(list(V1 = c(0L, 1L, 2L, 2L), V2 = c(1L, 1L, 2L, 2L),
V3 = c(1L, 1L, 1L, 1L), V4 = c(2L, 2L, 0L, 1L), V5 = c(NA,
1L, 0L, 0L), V6 = c(2L, 1L, NA, NA), V7 = c(1L, 1L, 0L, 0L
), V8 = c(1L, 1L, 0L, 0L)), class = "data.frame", row.names = c("Sample1",
"Sample1b", "Sample2", "Sample2b"))
> df
V1 V2 V3 V4 V5 V6 V7 V8
Sample1 0 1 1 2 NA 2 1 1
Sample1b 1 1 1 2 1 1 1 1
Sample2 2 2 1 0 0 NA 0 0
Sample2b 2 2 1 1 0 NA 0 0
An approach using pivot_longer
, then creating the group and filter by NA
, finally summarize
the result and paste
the output.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-V1) %>%
group_by(grp = sub("(.*\d+).*", "\1", V1), name) %>%
filter(!any(is.na(value))) %>%
summarize(mismatch = value[1] != value[2]) %>%
ungroup() %>%
summarize(mismatch = paste0(sum(mismatch), "/", n()), .by = grp)
output
# A tibble: 2 × 2
grp mismatch
<chr> <chr>
1 Sample1 2/7
2 Sample2 1/7
Data
df <- structure(list(V1 = c("Sample1", "Sample1b", "Sample2", "Sample2b"
), V2 = c(0L, 1L, 2L, 2L), V3 = c(1L, 1L, 2L, 2L), V4 = c(1L,
1L, 1L, 1L), V5 = c(2L, 2L, 0L, 1L), V6 = c(NA, 1L, 0L, 0L),
V7 = c(2L, 1L, NA, NA), V8 = c(1L, 1L, 0L, 0L), V9 = c(1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -4L
))
3
I would do the obvious
vapply(split(df0, sub("\D+$", "", row.names(df0))),
(x) mean(x[1L, ] != x[2L, ], na.rm=TRUE),
numeric(1L)) |> MASS::fractions()
which, obviously, assumes that there are always exactly two rows per sample group (1,2,3,…,n, n in natural numbers).
giving
Sample1 Sample2
2/7 1/7
Note
Data used
df0 = structure(list(V1 = c(0L, 1L, 2L, 2L), V2 = c(1L, 1L, 2L, 2L),
V3 = c(1L, 1L, 1L, 1L), V4 = c(2L, 2L, 0L, 1L),
V5 = c(NA,1L, 0L, 0L), V6 = c(2L, 1L, NA, NA),
V7 = c(1L, 1L, 0L, 0L), V8 = c(1L, 1L, 0L, 0L)),
class = "data.frame", row.names = c("Sample1", "Sample1b", "Sample2", "Sample2b"))
Another solution using rowSums
library(dplyr)
dat |>
mutate(group = gsub("\D+$", "", V1)) |>
mutate(.by = group, across(V2:V9, (x) first(x), .names="{.col}.")) |>
mutate(mismatch = rowSums(across(V2:V9) != across(V2.:V9.), na.rm=TRUE),
denom = rowSums(across(V2.:V9., (x) !is.na(x)), na.rm=TRUE),
result = as.character(MASS::fractions(mismatch / denom))) |>
filter(result != 0) |>
select(group, result)
# output
group result
1 Sample1 2/7
2 Sample2 1/7
data
dat <- read.table(text="Sample1 0 1 1 2 NA 2 1 1
Sample1b 1 1 1 2 1 1 1 1
Sample2 2 2 1 0 0 NA 0 0
Sample2b 2 2 1 1 0 NA 0 0")