I am new to R. I am working with a dataset in R where each patient is identified by a unique subjectid. Each patient has a baseline_date and multiple follow-up dates (fu_date). I want to filter follow-up records for each subjectid that fall within ±120 days of exactly 1 year (365 days) from the baseline_date.
After filtering, I need to calculate the total exacerbations (fu_exacerbations) for each patient during this period, making this a new column. How can I implement this in R?
This is the test_data.
Thanks a lot for your time!
I dont know how to go about it.
Salman Aslam Chaudhary is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
First of let’s look at your data:
The 31.6.2022 must be a mistake because June only has 30 days. It should be the 30.6.2022, right? I don’t know where this data is coming from, but it should be carefully checked if it’s health-related! We can do a basic sanity check on the dates of this column.
We should fill in the missing baseline_dates using the fill
function. Only then we can calculate the days from baseline like this days_from_baseline = as.numeric(difftime(fu_date, baseline_date, units = "days")))
and filter the abs-days-number for <= 120 days. Afterward, we summarize fu_exacerbations
and finally right join it with the underlying data frame:
library(tidyverse)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
# read data from csv
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# read csv in file folder
df <- read.csv("test_data.csv")
# To correct dates like "31.6.22" to "30.6.22":
clean_date <- function(date) {
# Ensure the date is character to avoid errors
date <- as.character(date)
# Extract components of the date
month <- as.numeric(stringr::str_extract(date, "^\d{1,2}"))
day <- as.numeric(stringr::str_extract(date, "(?<=/)\d{1,2}(?=/)"))
year <- as.numeric(stringr::str_extract(date, "\d{4}"))
# Correct invalid days based on the month
day <- dplyr::case_when(
month == 4 & day > 30 ~ 30, # April
month == 6 & day > 30 ~ 30, # June
month == 9 & day > 30 ~ 30, # September
month == 11 & day > 30 ~ 30, # November
month == 2 & day > 28 ~ 28, # February (not handling leap years)
TRUE ~ day
)
return(sprintf("%02d/%02d/%04d", month, day, year))
}
# use the function and then fill
result <- df %>% mutate(
baseline_date = if_else(baseline_date != "", clean_date(baseline_date), NA),
# sanity check dates with clean_date
fu_date = if_else(fu_date != "", clean_date(fu_date), NA)
) %>%
fill(baseline_date) %>%
# Group by patient
group_by(subjectid) %>%
# Calculate days from baseline for each follow-up date
mutate(days_from_baseline = as.numeric(difftime(
as.Date(fu_date, format = "%m/%d/%Y"),
as.Date(baseline_date, format = "%m/%d/%Y"),
units = "days"
))) %>%
# Filter records within ±120 days of 1 year
filter(abs(days_from_baseline - 365) <= 120) %>%
# Sum exacerbations for each patient
summarize(total_exacerbations = sum(fu_exacerbations, na.rm = TRUE)) %>%
# Join back with original data to keep other baseline information if needed
right_join(df %>% select(subjectid, baseline_date) %>% distinct(subjectid),
by = "subjectid")
print(result)
It will look like this:
# A tibble: 2 × 3
subjectid total_exacerbations baseline_date
<int> <int> <date>
1 2 4 2022-06-30
2 1 NA 2022-11-14
1