I have a data frame of county executives and the first and last years they served.
I am running a panel study with county-year as the unit of analysis. The date range is 2000 to 2009.
I will like to expand the df such that it lists the period served by each executive in long format. Some executives serve until December 31st but others are inaugurated after the beginning of the calendar year.
The catch? If two or more executives served terms during the same calendar year, I want the data frame to attribute the county-year to the executive that was in office the longest.
Also note: caretaker executives that only served a few weeks should not be included in the data frame at all (ie. Tollson and Edwards).
My starting data looks like this:
df.a1 <- data.frame(executive.name= rep(c("Johnson", "Alleghany", "Clarke", "Roland", "Tollson", "Richards", "Peters", "Harrison", "Burr", "Diamond", "Edwards", "Gorman"),each=2),
date= rep(c("from 01-Jan-2000", "to 31-Dec-2002", "from 01-Jan-2003", "to 03-Mar-2004", "from 04-Mar-2004", "to 05-Nov-2005", "from 06-Nov-2005", "to 31-Dec-2007", "from 01-Jan-2008", "to 03-Mar-2008", "from 04-Mar-2008", "to 30-Nov-2009"), times=2),
district= c(rep(1001:1002, each=12)))
> df.a1
executive.name date district
1 Johnson from 01-Jan-2000 1001
2 Johnson to 31-Dec-2002 1001
3 Alleghany from 01-Jan-2003 1001
4 Alleghany to 03-Mar-2004 1001
5 Clarke from 04-Mar-2004 1001
6 Clarke to 05-Nov-2005 1001
7 Roland from 06-Nov-2005 1001
8 Roland to 31-Dec-2007 1001
9 Tollson from 01-Jan-2008 1001
10 Tollson to 03-Mar-2008 1001
11 Richards from 04-Mar-2008 1001
12 Richards to 30-Nov-2009 1001
13 Peters from 01-Jan-2000 1002
14 Peters to 31-Dec-2002 1002
15 Harrison from 01-Jan-2003 1002
16 Harrison to 03-Mar-2004 1002
17 Burr from 04-Mar-2004 1002
18 Burr to 05-Nov-2005 1002
19 Diamond from 06-Nov-2005 1002
20 Diamond to 31-Dec-2007 1002
21 Edwards from 01-Jan-2008 1002
22 Edwards to 03-Mar-2008 1002
23 Gorman from 04-Mar-2008 1002
24 Gorman to 30-Nov-2009 1002
I will like it to look like this:
df.a1.neat <- data.frame(executive.name= c("Johnson", "Johnson", "Alleghany", "Clarke", "Clarke", "Roland", "Roland", "Richards", "Richards", "Peters", "Peters", "Harrison", "Burr", "Burr", "Diamond", "Diamond", "Gorman", "Gorman"),
date= rep(c(2000, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009), times=2),
district= c(rep(1001:1002, each=9)))
executive.name date district
1 Johnson 2000 1001
2 Johnson 2002 1001
3 Alleghany 2003 1001
4 Clarke 2004 1001
5 Clarke 2005 1001
6 Roland 2006 1001
7 Roland 2007 1001
8 Richards 2008 1001
9 Richards 2009 1001
10 Peters 2000 1002
11 Peters 2002 1002
12 Harrison 2003 1002
13 Burr 2004 1002
14 Burr 2005 1002
15 Diamond 2006 1002
16 Diamond 2007 1002
17 Gorman 2008 1002
18 Gorman 2009 1002
4
This is including all years, because the logic is easier to handle programmatically – if you have a reason that you wish some years removed, I’ll let you handle that aspect.
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
df.a1 |>
# Set up date_from and date_to vars for period calculation
mutate(date_from = if_else(str_detect(date, "from"), date, lag(date)),
date_to = if_else(str_detect(date, "to"), date, lead(date))) |>
select(-date) |>
# Remove duplicates
distinct() |>
# Further preparing for period calculation
mutate(date_from = str_remove(date_from, "from "),
date_to = str_remove(date_to, "to "),
across(date_from:date_to, dmy),
year_from = year(date_from),
year_to = year(date_to)) |>
# Setting data to long format in order to vectorize period calculation
pivot_longer(year_from:year_to, names_to = "start_end", values_to = "year") |>
# Filling in all years and then resorting the rows back to original sort
group_by(executive.name, district) |>
complete(year = full_seq(min(year):max(year), 1)) |>
ungroup() |>
arrange(district, year) |>
# Calculating the period for each year, district, and name
mutate(period = case_when(year(date_from) == year(date_to) ~ date_to - date_from,
start_end == "year_from" ~ as.Date(paste0(year, "-12-31")) - date_from,
start_end == "year_to" ~ date_to - as.Date(paste0(year, "-01-01")),
TRUE ~ as.Date(paste0(year, "-12-31")) - as.Date(paste0(year, "-01-01"))), .by = year) |>
# Remove unwanted names for a given district and year
filter(period == max(period), .by = c(district, year)) |>
# Set data frame to desired output
select(executive.name, year, district)
# A tibble: 20 × 3
executive.name year district
<chr> <dbl> <int>
1 Johnson 2000 1001
2 Johnson 2001 1001
3 Johnson 2002 1001
4 Alleghany 2003 1001
5 Clarke 2004 1001
6 Clarke 2005 1001
7 Roland 2006 1001
8 Roland 2007 1001
9 Richards 2008 1001
10 Richards 2009 1001
11 Peters 2000 1002
12 Peters 2001 1002
13 Peters 2002 1002
14 Harrison 2003 1002
15 Burr 2004 1002
16 Burr 2005 1002
17 Diamond 2006 1002
18 Diamond 2007 1002
19 Gorman 2008 1002
20 Gorman 2009 1002
df.a1 |>
mutate(dt = dmy(date), year = year(dt), day = yday(dt),
years = list(full_seq(year, 1)),
days = ifelse(str_detect(date, fixed("to")), day, 365 - day + 1),
days = ifelse(n_distinct(year) == 1, max(dt) - min(dt), days),
.by = executive.name) |>
filter(max(dt) - min(dt) > 100, .by = executive.name) |>
slice_max(days, n = 1, by = c(year, district)) |>
mutate(years = map2(row_number(), list(pick(everything())), (rn, d) {
setdiff(d$years[[rn]], d$year[d$executive.name != d$executive.name[[rn]]])
}), .by = district) |>
select(executive.name, years, district) |>
distinct() |>
unnest(years)
# # A tibble: 20 × 3
# executive.name years district
# <chr> <dbl> <int>
# 1 Johnson 2000 1001
# 2 Johnson 2001 1001
# 3 Johnson 2002 1001
# 4 Alleghany 2003 1001
# 5 Clarke 2004 1001
# 6 Clarke 2005 1001
# 7 Roland 2006 1001
# 8 Roland 2007 1001
# 9 Richards 2008 1001
# 10 Richards 2009 1001
# 11 Peters 2000 1002
# 12 Peters 2001 1002
# 13 Peters 2002 1002
# 14 Harrison 2003 1002
# 15 Burr 2004 1002
# 16 Burr 2005 1002
# 17 Diamond 2006 1002
# 18 Diamond 2007 1002
# 19 Gorman 2008 1002
# 20 Gorman 2009 1002
Some information on how this works:
mutate
: this creates a date columndt
and extracts date components for other computations (e.g. year, day of the year). A list-columnyears
is created that contains all the years anexecutive.name
served terms. As you mentioned some years have overlap, this is take care of later. Also,days
column is created. This contains the number of days anexecutive.name
served within a year. This is used as a tie breaker for those that served part of a year. The year is assigned to theexecutive.name
that served the largest portion of the year in a later step. A small correction is made todays
to ensure the right output if they served less than a year.filter
: this keeps rows that served more than 100 days. This excludes “Tollson”. You can set this filter to be different based on your needs.slice_max
: this removes rows within a county and year when there was more than oneexecutive.name
. By taking the maxdays
, we keep the row of theexecutive.name
that served a higher proportion of the year within a district.mutate
: this step is a bit tricky, butyears
is for every year anexecutive.name
served a term. However, in split years you want to assign that district and year to one individual that served the most days. This step essentially removes those years from a term span that were not assigned to that individual. For example, Roland started in 2005 in district 1001. Since Clarke served a larger portion of that year, that year is assigned to Clarke, so we remove it from Roland’syears
span.- The last three steps (
select
,distinct
, andunnest
) simply format the data into a long format.