I have a data frame with these courses. And I’m trying to map in a weighted way the families that consume more than one course. However, families that consume only one course end up being mentioned as well. I want to remove this from the table and have it be weighted and in percentage. And I want a column with the frequency without weight next to it. I made this script, it was the furthest I got
library(dplyr)
library(tidyr)
library(writexl)
# Combine columns to create ID_FAMILIA
POF_DESPESA_INDIVIDUAL <- POF_DESPESA_INDIVIDUAL %>%
unite("ID_FAMILY", COD_UPA:NUM_UC, sep = "-", remove = FALSE)
# Filter courses and map to categories
CURSOS_TESTE <- POF_DESPESA_INDIVIDUAL %>%
filter(V9001 %in% c("4901701", "4901702", "4901801", "4901802", "4906201",
"4906001", "4902801", "4904101", "4902802", "4906501",
"4906601", "4906701", "4906801", "4906901", "4907001",
"4907101", "4907201", "4907301", "4907401", "4902501",
"4902601", "4902701", "4909701", "4909801", "4909901",
"4910001", "4910101", "4909401", "4902201", "4900701",
"4901501", "4901401")) %>%
mutate(Categories = case_when(
V9001 == "4901401" ~ "Computer Science",
V9001 == "4901501" ~ "Languages",
V9001 %in% c("4902201", "4909401") ~ "Private Lessons",
V9001 %in% c("4906501", "4906601", "4906701", "4906801", "4906901",
"4907001", "4907101", "4907201", "4907301", "4907401",
"4902501", "4902601", "4902701", "4909701", "4909801",
"4909901", "4910001", "4910101") ~ "Sports",
V9001 %in% c("4901701", "4901702", "4901801", "4901802", "4906201",
"4906001", "4902801", "4904101", "4902802") ~ "Arts",
V9001 == "4900701" ~ "Pre-Vocational",
TRUE ~ "Other"
))
# Aggregate data by family and combine categories
results <- CURSOS_TESTE %>%
group_by(ID_FAMILY) %>%
summarise(Categories = paste(unique(Categories), collapse = ", "),
Total_Weight = sum(WEIGHT_FINAL),
.groups = 'drop') %>%
mutate(Categories = strsplit(Categories, ", ")) %>%
unnest(Categories) %>%
group_by(ID_FAMILY) %>%
summarise(Categories = list(sort(unique(Categories))),
Total_Weight = unique(Total_Weight),
.groups = 'drop') %>%
mutate(Comb = sapply(Categories, function(x) paste(x, collapse = ", ")))
# Calculate total frequencies without weights
frequency_no_weight <- CURSOS_TESTE %>%
group_by(ID_FAMILY, Categories) %>%
summarise(.groups = 'drop') %>%
group_by(Categories) %>%
summarise(Frequency_No_Weight = n_distinct(ID_FAMILY), # Number of distinct families for each category
.groups = 'drop')
# Calculate total weight
total_weight <- sum(results$Total_Weight)
# Aggregate weights by category combination
comb_count <- results %>%
group_by(Comb) %>%
summarise(Frequency_Count = n_distinct(ID_FAMILY),
Total_Weight = sum(Total_Weight),
.groups = 'drop') %>%
left_join(frequency_no_weight, by = c("Comb" = "Categories")) %>%
mutate(Weighted_Percentage = (Total_Weight / total_weight) * 100)
# Print the result
print(comb_count)
# Write the result to an Excel file
write_xlsx(comb_count, "family.xlsx")
the frequency counted is different from the frequency without weight. Which comes in NA, and the percentage is coming with the values of families that take only one course. And it ends up changing the percentage.