I want to create a function named ratio_function that does the same as the following code:
data = data %>%
group_by(ID) %>%
summarise(sum_ratio = sum(surface[category == "A"], na.rm = T)/sum(total_area[category == "A"], na.rm = T)*mean(`MEAN`[category == "A"], na.rm = T))
but inside of summarise such as:
data = data %>%
group_by(ID) %>%
summarise(sum_ratio = ratio_function("A"))
The problem is that surface, total_area and category aren’t recognized as variable name in summarise once they are called in the function.
When creating a function, you have to add all objects you want to pass inside the function as arguments for the function itself. In your case, your function probably can’t find the columns because the function does not specify them as arguments, therefore they don’t exist inside the function. You have to simply add the variable names as arguments, like this:
ratio_function <- function(surface, total_area, MEAN, category, selected_category = "A") {
sum(surface[category == "A"], na.rm = T)/sum(total_area[category == selected_category], na.rm = T)*mean(`MEAN`[category == selected_category], na.rm = T)
}
data %>%
group_by(ID) %>%
summarise(sum_ratio = ratio_function(surface, total_area, MEAN, category, "A"))
In this case, I added the variable names as arguments for the function, but when using the function you can specify different columns to use for each part of your calculation. For example, exchanging surface
for another column. This will probably create confusion in the future, and you may want to rewrite your function so that the arguments are more descriptive of what they do instead of simply being the names of the columns you had in your data.
1
If it’s about the result rather than the method, what about:
library(dplyr)
## some play data:
data <-
data.frame(category = gl(3, 5, labels = LETTERS[1:3]),
surface = runif(15, 0, 10),
total_area = runif(15, 0, 30),
MEAN = runif(15, 15, 30)
)
## > head(data)
## category surface total_area MEAN
## 1 A 8.665776 3.560259 16.88902
## 2 A 9.116400 7.484434 20.31923
## 3 A 8.628712 28.325483 25.01351
standard {dplyr} procedure:
data |>
summarise(sum_ratio = sum(surface, na.rm = T) /
sum(total_area, na.rm = T) *
mean(MEAN, na.rm = T),
.by = category) |>
filter(category == 'A')
## category sum_ratio
## 1 A 12.02382
1
If you do not want to pass the names of the other relevant columns one by one to the function, you would have to pass the entire dataframe to work on:
library(tidyverse)
# generate data
data <- tribble(
~ID, ~surface, ~total_area, ~category, ~MEAN,
1,50,200,"A",1.5,
1,30,150,"A",1.2,
1,20,100,"B",0.8,
2,70,300,"A",2.0,
2,60,250,"B",1.0,
2,80,350,"A",1.8,
3,40,180,"A",1.4,
3,20,90,"A",1.1,
3,30,130,"B",0.9,
4,55,220,"A",1.6,
4,45,180,"A",1.3,
4,25,90,"B",0.7
)
# old approach
data |>
group_by(ID) |>
summarise(sum_ratio = sum(surface[category == "A"], na.rm = T) / sum(total_area[category == "A"], na.rm = T) *
mean(`MEAN`[category == "A"], na.rm = T))
#> # A tibble: 4 × 2
#> ID sum_ratio
#> <dbl> <dbl>
#> 1 1 0.309
#> 2 2 0.438
#> 3 3 0.278
#> 4 4 0.363
# define function
ratio_function <- function(df, category) {
sum(df$surface[df$category == "A"], na.rm = T) / sum(df$total_area[df$category == "A"], na.rm = T) *
mean(df$`MEAN`[df$category == "A"], na.rm = T)
}
# new approach
data |>
group_by(ID) |>
summarize(new = ratio_function(pick(everything()), "A"))
#> # A tibble: 4 × 2
#> ID new
#> <dbl> <dbl>
#> 1 1 0.309
#> 2 2 0.438
#> 3 3 0.278
#> 4 4 0.363
Created on 2024-07-19 with reprex v2.1.1