I’m trying to create a plot in R using ggplot2 where I have two lines representing percentages for two different groups (Male and Female) across different education levels. I want to shade the area between these two lines based on which group has the higher percentage.
However, I’m having trouble getting the shaded area to display correctly. Here’s the code I’m using:
library(tidyverse)
desired_levels <- c("Ninguno", "Educación preescolar",
"Educación básica primaria (1 a 5 grado)",
"Educación básica primaria (6 a 9 grado)",
"Educación Media (10 a 11 grado)",
"Tecnológico sin graduar",
"Técnico graduado",
"Universitario sin graduar",
"Universitario graduado",
"Tecnológico graduado",
"Técnico sin graduar",
"Postgrado")
graph1 <- function(data, barrio) {
# Filtering data
data <- data %>% filter(`¿Cuál es el nivel educativo más alto que esta cursando o ha cursado?` != "No sabe, no informa")
table <- data %>%
group_by(`¿Cuál es el nivel educativo más alto que esta cursando o ha cursado?`, Género) %>%
summarise(n = sum(EXPANSOR), .groups = 'drop') %>%
group_by(`¿Cuál es el nivel educativo más alto que esta cursando o ha cursado?`) %>%
mutate(percentage = n / sum(n) * 100) %>%
ungroup() %>%
rename(NivelEducativo = `¿Cuál es el nivel educativo más alto que esta cursando o ha cursado?`)
table$NivelEducativo <- factor(
table$NivelEducativo,
levels = desired_levels
)
# Ensuring all education levels are represented
table <- table %>%
complete(NivelEducativo, Género, fill = list(n = 0, percentage = 0))
# Creating bounds table for shaded area
bounds <- table %>%
pivot_wider(names_from = Género, values_from = percentage) %>%
mutate(
ymax = pmax(Masculino, Femenino, na.rm = TRUE),
ymin = pmin(Masculino, Femenino, na.rm = TRUE),
fill = Femenino > Masculino
) %>%
replace_na(list(Masculino = 0, Femenino = 0, ymax = 0, ymin = 0))
# Creating the plot
plot <- ggplot() +
geom_ribbon(data = bounds, aes(x = NivelEducativo, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.4) +
geom_line(data = table, aes(x = NivelEducativo, y = percentage, group = Género, color = Género), size = 1) +
geom_point(data = table, aes(x = NivelEducativo, y = percentage, shape = Género, color = Género), size = 4) +
geom_text(data = table, aes(x = NivelEducativo, y = percentage, label = paste0(round(n), " (", round(percentage), "%)")),
position = position_dodge(width = 0.9), size = 4, color = "black", vjust = -0.5) +
scale_fill_manual(values = c("TRUE" = "#e31a1c", "FALSE" = "#1f78b4"), guide = "none") +
scale_color_manual(values = c("Masculino" = "#1f78b4", "Femenino" = "#e31a1c")) +
scale_shape_manual(values = c("Masculino" = 21, "Femenino" = 23)) +
theme_minimal() +
labs(title = "Nivel educativo más alto cursado por género en CB",
subtitle = "Distribución porcentual por género",
x = "Nivel educativo", y = "Porcentaje (%)",
color = "Género",
shape = "Género") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 14),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "right",
legend.background = element_rect(color = "gray", size = 0.1),
legend.key = element_blank(),
panel.grid.major = element_line(color = "gray", size = 0.2),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "black", size = 0.5))
return(plot)
}
# Example call to the function (ensure 'data_cb' is defined)
graph1(data_cb, "CB")
However, the shaded area does not appear correctly between the two lines. Instead, it seems to be calculated incorrectly, with many NA values or improper bounds.
What am I doing wrong? How can I correctly display the shaded area between the two lines?
Actual output.
desired output (concept):