I am using summarize()
to transform one dataframe (cid_term
) to another one (row_1dyadep
). I want to receive one row for each value of a grouping variable (new_dyadep_id
). For many of the variables, I want to use the last non-NA value in the summarized output (because they are either NA or all the same anyway). My problem is that one of these variables is a date column (ependdate
), and I have not been able to preserve it in the correct form. For example, the value in the first data-frame “1991-10-10” becomes “7952” after summarization.
This is the code I used:
row_1dyadep <- cid_term %>%
group_by(new_dyadep_id) %>%
summarize(
new_dyadep_id = last(new_dyadep_id),
dyad_id = ifelse(all(is.na(dyad_id)), NA, last(dyad_id[!is.na(dyad_id)], default = NA)),
location = ifelse(all(is.na(location)), NA, last(location[!is.na(location)], default = NA)),
first_year_active = ifelse(all(is.na(first_year_active)), NA, last(first_year_active[!is.na(first_year_active)], default = NA)),
last_year_active = ifelse(all(is.na(last_year_active)), NA, last(last_year_active[!is.na(last_year_active)], default = NA)),
ependdate = ifelse(all(is.na(ependdate)), NA, last(ependdate[!is.na(ependdate)], default = NA)),
incompatibility = ifelse(all(is.na(incompatibility)), NA, last(incompatibility[!is.na(incompatibility)], default = NA)),
conflict_id = ifelse(all(is.na(conflict_id)), NA, last(conflict_id[!is.na(conflict_id)], default = NA)),
side_a = ifelse(all(is.na(side_a)), NA, last(side_a[!is.na(side_a)], default = NA)),
side_b = ifelse(all(is.na(side_b)), NA, last(side_b[!is.na(side_b)], default = NA)),
conflict = ifelse(all(is.na(conflict)), NA, last(conflict[!is.na(conflict)], default = NA)),
dyadepisode = ifelse(all(is.na(dyadepisode)), NA, last(dyadepisode[!is.na(dyadepisode)], default = NA)),
outcome = ifelse(all(is.na(outcome)), NA, last(outcome[!is.na(outcome)], default = NA)),
territory_name = ifelse(all(is.na(territory_name)), NA, last(territory_name[!is.na(territory_name)], default = NA)),
region = ifelse(all(is.na(region)), NA, last(region[!is.na(region)], default = NA)),
last_dyadcount = ifelse(all(is.na(dyadcount)), NA, last(dyadcount[!is.na(dyadcount)], default = NA)),
max_dyadcount = max(dyadcount, na.rm = TRUE),
last_intensity_level = ifelse(all(is.na(intensity_level)), NA, last(intensity_level[!is.na(intensity_level) & active_year == 1], default = NA)),
max_intensity_level = max(intensity_level, na.rm = TRUE),
last_type_of_conflict = ifelse(all(is.na(type_of_conflict)), NA, last(type_of_conflict[!is.na(type_of_conflict)], default = NA)),
max_type_of_conflict = max(type_of_conflict, na.rm = TRUE),
avg_issue_count_per_year = mean(issue_count_per_year, na.rm = TRUE),
obs_years_per_episode = n(), # nrows per episode
across(all_of(issue_columns), ~ sum(.x, na.rm = TRUE), .names = "freq_{col}"),
across(starts_with("ethnicity"), ~ ifelse(all(is.na(.x)), NA, last(.x[!is.na(.x)], default = NA)), .names = "{col}"),
across(starts_with("geography"), ~ ifelse(all(is.na(.x)), NA, last(.x[!is.na(.x)], default = NA)), .names = "{col}"),
across(starts_with("ideology"), ~ ifelse(all(is.na(.x)), NA, last(.x[!is.na(.x)], default = NA)), .names = "{col}"),
across(starts_with("religion"), ~ ifelse(all(is.na(.x)), NA, last(.x[!is.na(.x)], default = NA)), .names = "{col}")
)
The relevant line (I suppose) is the one starting with ependdate
. Note that using a combination of last()
and na-omit()
(as ChatGPT has been proposing) does not work, because then I end up with less rows in row_1dyadep
than I have unique values of new_dyadep_id
. I think because with that code, groups of new_dyadep_id
that show only NAs in ependdate
would be dropped entirely which I want to avoid.