I have a long table format and try to construct a wide format that contains additional variables based on the rows.
My data has rows based on patients, their treatments, and their visits. Each row represents a single visit as the lowest level in the data structure hierarchy. Below I have tried to illustrate the structure, although the real data has many more variables that I wish to keep in the output. Here I just illustrated additional variables as var_x and var_y.
I wish to get a table that:
- keeps all variables from the original dataset
- only include the first visit row per patient, and
- includes the derived bio_drug_stop_date in that same row when there is no info about the stop date, which is often the case. The derived stop date is based on the date of the next treatment (first case of bio_drug_series==2) minus one day. Derived date is only added if there is not already info about stop date.
- Some patients may have only one treatment, and if so, the stop date is kept empty if there is no prior info about stop date.
As you can see in the example, the patient may have several visits before a new treatment starts. In the ‘want’ table I have tried to illustrate that patient 1 has bio_drug_stop_date based on existing info from original table and patient 2 and 3 have bio_drug_stop_date derived from first case of bio_drug_series==2.
have:
df1 <- data.frame(
patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
bio_drug_series_number = c(1,1,2,1,2,3,1,1,2,3),
bio_drug_start_date = c("12-01-2015", "12-01-2015", "20-05-2008", "01-09-2020", "03-09-2020", "04-02-2021", "05-07-2004", "05-07-2004", "17-09-2011", "24-05-2019"),
bio_drug_stop_date = c("18-05-2008", "18-05-2008", NA, NA, NA, NA, NA, NA, NA, NA),
var_x = c("c", "a", "f", "g", "n", "o", "p", "q", "t", "u"),
var_y = c("d", "b", "e", "m", "l", "k", "h", "r", "s", "v")
)
want:
df222 <- data.frame(
patient_id = c(1,2,3),
bio_drug_series_number = c(1,1,1),
bio_drug_start_date = c("12-01-2015", "01-09-2020", "05-07-2004"),
bio_drug_stop_date = c("18-05-2008", "02-09-2020", "16-09-2011"),
var_x = c("c", "g", "p"),
var_y = c("d", "m", "h")
)
I would have thought I could find a similar case on StackOverflow, but wasn’t able to.