I’m exploring the across() function introduced in recent versions of dplyr, and I’m trying to understand how to use it to apply a custom function that returns multiple columns. Specifically, I want to apply a function that calculates both the mean and standard deviation for selected numeric columns in my data frame and returns these as separate columns.
For example, given the following data frame:
library(dplyr)
df <- data.frame(
Group = rep(letters[1:3], each = 4),
Value1 = rnorm(12, mean = 10, sd = 2),
Value2 = rnorm(12, mean = 5, sd = 1)
)
I want to create a new data frame that includes the mean and standard deviation for each value column, something like this:
Group Mean_Value1 SD_Value1 Mean_Value2 SD_Value2
1 a 9.812 2.034 4.955 1.085
2 b 10.231 1.987 5.023 0.923
3 c 10.032 2.121 4.998 1.098
I’ve tried the following approach but I’m not sure how to make it work properly with across()
:
df_summary <- df %>%
group_by(Group) %>%
summarise(across(starts_with("Value"), ~ c(mean = mean(.), sd = sd(.))))
This throws an error because across() doesn’t seem to naturally handle functions that return multiple columns.
My specific questions are:
- How can I modify this approach to properly use
across()
for functions that return multiple values? - Is there a better way to achieve this using
dplyr
or another package in R? - What are the limitations of
across()
when dealing with custom functions like this?
Any guidance on how to accomplish this would be greatly appreciated!
2
Your question is actually listed as an example in the documentation page of across
.
You should use list
to include multiple functions for across
.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(starts_with("Value"), list(mean = mean, sd = sd)))
# A tibble: 3 × 5
Group Value1_mean Value1_sd Value2_mean Value2_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 8.61 0.837 5.57 0.581
2 b 8.90 2.08 5.22 0.479
3 c 10.3 1.98 4.36 0.465
2
To address
Is there a better way to achieve this using dplyr or another package in R?
There are a couple of packages providing such grouping functions. If we define “better” as without the use of external packages, we can do:
aggregate(df[grepl("Value", names(df))], df["Group"], (x) c(Mean=mean(x), SD=sd(x)))
giving
Group Value1.Mean Value1.SD Value2.Mean Value2.SD
1 a 10.901248 2.365063 4.5826417 0.8582879
2 b 9.358671 2.549811 4.9142623 1.0512226
3 c 11.040255 1.491652 5.2339545 1.0130163
This might be an alternative if the way aggregate()
displays [edited verb] column names does not bother you.
1