I’m trying to find a solution in base
R
that can calculate the median by groups based on two columns in addition to groups by the occurrence of multiple ids.
For example:
I have a data.frame
df = data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,17,18,18,19,19,20,20),
group = c("A","A","A","A","A","A","A","A","B","B","B","B","C","C","C","C","A","B","A","B","A","B","A","B"),
num = c(0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),
value = c(sample(10:90, 16, replace = TRUE), rep(c(21,67,80,69), times=1, each=2)))
> df
id group num value
1 1 A 0.1 59
2 2 A 0.1 72
3 3 A 0.1 82
4 4 A 0.1 17
5 5 A 0.2 39
6 6 A 0.2 46
7 7 A 0.2 39
8 8 A 0.2 56
9 9 B 0.1 31
10 10 B 0.1 46
11 11 B 0.1 63
12 12 B 0.1 15
13 13 C 0.1 51
14 14 C 0.1 68
15 15 C 0.1 48
16 16 C 0.1 28
17 17 A 0.1 21
18 17 B 0.1 21
19 18 A 0.1 67
20 18 B 0.1 67
21 19 A 0.1 80
22 19 B 0.1 80
23 20 A 0.1 69
24 20 B 0.1 69
and I’m trying to replace the values in the column value
by the median based on groups from the columns group
and num
, meaning I want the median
for the group A 0.1
, A 0.2
, B 0.1
, C 0.1
and now to the tricky part, in which additionally values are grouped based on multiple occurrences of the id, in that case id
17, 18, 19 and 20 belong to the same group AB 0.1
.
This should lead to a result with the following df
:
> df
id group num value
1 1 A 0.1 65.5
2 2 A 0.1 65.5
3 3 A 0.1 65.5
4 4 A 0.1 65.5
5 5 A 0.2 42.5
6 6 A 0.2 42.5
7 7 A 0.2 42.5
8 8 A 0.2 42.5
9 9 B 0.1 38.5
10 10 B 0.1 38.5
11 11 B 0.1 38.5
12 12 B 0.1 38.5
13 13 C 0.1 49.5
14 14 C 0.1 49.5
15 15 C 0.1 49.5
16 16 C 0.1 49.5
17 17 A 0.1 68.0
18 17 B 0.1 68.0
19 18 A 0.1 68.0
20 18 B 0.1 68.0
21 19 A 0.1 68.0
22 19 B 0.1 68.0
23 20 A 0.1 68.0
24 20 B 0.1 68.0
How can this be accomplished with base
R
without writing a function? Could this even be done in an one-liner
?
First create an additional group by pasting group by id then split
twice by the new group and num.
grp <- unlist(sapply(split(df, df$id), (x)
cbind(x, grp = paste(x$group, collapse=""))$grp))
data.frame(do.call(rbind,
lapply(split(df, grp), (x)
do.call(rbind, lapply(split(x, x$num), (y)
cbind(y[,-4], value = median(y$value)))))),
row.names=NULL)
id group num value
1 1 A 0.1 45.5
2 2 A 0.1 45.5
3 3 A 0.1 45.5
4 4 A 0.1 45.5
5 5 A 0.2 56.0
6 6 A 0.2 56.0
7 7 A 0.2 56.0
8 8 A 0.2 56.0
9 17 A 0.1 68.0
10 17 B 0.1 68.0
11 18 A 0.1 68.0
12 18 B 0.1 68.0
13 19 A 0.1 68.0
14 19 B 0.1 68.0
15 20 A 0.1 68.0
16 20 B 0.1 68.0
17 9 B 0.1 46.5
18 10 B 0.1 46.5
19 11 B 0.1 46.5
20 12 B 0.1 46.5
21 13 C 0.1 64.0
22 14 C 0.1 64.0
23 15 C 0.1 64.0
24 16 C 0.1 64.0
Data
set.seed(1)
df <- data.frame(
id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,17,18,18,19,19,20,20),
group = c("A","A","A","A","A","A","A","A","B","B","B","B","C","C","C","C","A","B","A","B","A","B","A","B"),
num = c(0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),
value = c(sample(10:90, 16, replace = TRUE), rep(c(21,67,80,69), times=1, each=2)))
2
Here is one way to achieve this in base R.
We create a new grouping variable (new_group
) according to our conditions. Using duplicated
twice we identify the id
‘s that repeat and assign them a random grouping number (here I have assigned 999) for the rest we create a sequential grouping number based on their occurrence using match
and unique
.
After this grouping variable is created we can use ave
to calculate median
per group.
val <- paste(df$group, df$num, sep = "-")
df |>
transform(new_group = ifelse(duplicated(id) | duplicated(id, fromLast = TRUE),
999, match(val, unique(val)))) |>
transform(value = ave(value, new_group, FUN = median))
# id group num value new_group
#1 1 A 0.1 65.5 1
#2 2 A 0.1 65.5 1
#3 3 A 0.1 65.5 1
#4 4 A 0.1 65.5 1
#5 5 A 0.2 42.5 2
#6 6 A 0.2 42.5 2
#7 7 A 0.2 42.5 2
#8 8 A 0.2 42.5 2
#9 9 B 0.1 38.5 3
#10 10 B 0.1 38.5 3
#11 11 B 0.1 38.5 3
#12 12 B 0.1 38.5 3
#13 13 C 0.1 49.5 4
#14 14 C 0.1 49.5 4
#15 15 C 0.1 49.5 4
#16 16 C 0.1 49.5 4
#17 17 A 0.1 68.0 999
#18 17 B 0.1 68.0 999
#19 18 A 0.1 68.0 999
#20 18 B 0.1 68.0 999
#21 19 A 0.1 68.0 999
#22 19 B 0.1 68.0 999
#23 20 A 0.1 68.0 999
#24 20 B 0.1 68.0 999
I have purposefully kept the new_group
column so as to understand how the groups were created. If the column is not needed you may add |> subset(select = -new_group)
to the pipe.
A one-liner (if you don’t consider adding a new column part of the code)
> df$newid=ave(df$id,list(df$group,df$num),FUN=min)
> df[ave(df$id,df$id,FUN=length)>1,"newid"]=999
> df$value=ave(df$value,df$newid,FUN=median)
id group num value newid
1 1 A 0.1 71.0 1
2 2 A 0.1 71.0 1
3 3 A 0.1 71.0 1
4 4 A 0.1 71.0 1
5 5 A 0.2 32.0 5
6 6 A 0.2 32.0 5
7 7 A 0.2 32.0 5
8 8 A 0.2 32.0 5
9 9 B 0.1 32.5 9
10 10 B 0.1 32.5 9
11 11 B 0.1 32.5 9
12 12 B 0.1 32.5 9
13 13 C 0.1 58.5 13
14 14 C 0.1 58.5 13
15 15 C 0.1 58.5 13
16 16 C 0.1 58.5 13
17 17 A 0.1 68.0 999
18 17 B 0.1 68.0 999
19 18 A 0.1 68.0 999
20 18 B 0.1 68.0 999
21 19 A 0.1 68.0 999
22 19 B 0.1 68.0 999
23 20 A 0.1 68.0 999
24 20 B 0.1 68.0 999
Note that my results are different than yours since you generate the data randomly without a seed.
3