Suppose there is a hospital clinic that has a list of how many patients visited the hospital every day. I have data over 10 years – but patients don’t visit the clinic everyday. To give an example, the data looks like this (in R):
library(dplyr)
set.seed(123)
start_date <- as.Date("2010-01-01")
end_date <- as.Date("2019-12-31")
all_dates <- seq.Date(start_date, end_date, by="day")
num_visits <- sample(1:length(all_dates), size = 3000, replace = FALSE)
visit_dates <- all_dates[num_visits]
num_patients <- sample(1:100, size = length(visit_dates), replace = TRUE)
clinic_data <- data.frame(date = visit_dates, num_patients = num_patients)
hospital_data <- clinic_data %>% arrange(date)
date num_patients
2010-01-01 90
2010-01-02 96
2010-01-04 65
2010-01-05 80
2010-01-06 15
2010-01-07 87
I want to try and answer the following question: On average – for any given month, what percent of all patients that month will have visited the clinic by day $y$? For example, suppose in some month I know that 900 people visited the hospital – I want to know that by the 19th, what percent (cumulatively) of these 900 had likely visited the hospital by then based on previous trends?
I tried to do this by manually identifying different logical steps:
library(ggplot2)
hospital_data$year <- as.numeric(format(as.Date(hospital_data$date), "%Y"))
hospital_data$month <- as.numeric(format(as.Date(hospital_data$date), "%m"))
hospital_data$day <- as.numeric(format(as.Date(hospital_data$date), "%d"))
hospital_data <- hospital_data[order(hospital_data$date), ]
yearly_totals <- aggregate(num_patients ~ year, data = hospital_data, FUN = sum)
names(yearly_totals)[2] <- "yearly_total"
hospital_data <- merge(hospital_data, yearly_totals, by = "year")
results <- by(hospital_data, hospital_data$year, function(df) {
df$cumulative_patients <- cumsum(df$num_patients)
df$cumulative_percentage <- df$cumulative_patients / df$yearly_total * 100
return(df)
})
results <- do.call(rbind, results)
avg_results <- aggregate(cumulative_percentage ~ day, data = results, FUN = mean, na.rm = TRUE)
avg_results <- avg_results[order(avg_results$day), ]
ggplot(avg_results, aes(x = day, y = cumulative_percentage)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1, 31, by = 5)) +
scale_y_continuous(limits = c(0, 100)) +
labs(title = "Average Cumulative Percentage of Yearly Patients by Day",
x = "Day of Month",
y = "Average Cumulative Percentage of Patients") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
But my graph is not displaying this cumulative percent:
Does someone have ideas where I am messing this up?
farrow90 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.