data frame breakdown by year, rounding up by duration of each period

I have a data frame of county executives and the first and last years they served.

I am running a panel study with county-year as the unit of analysis. The date range is 2000 to 2009.

I will like to expand the df such that it lists the period served by each executive in long format. Some executives serve until December 31st but others are inaugurated after the beginning of the calendar year.

The catch? If two or more executives served terms during the same calendar year, I want the data frame to attribute the county-year to the executive that was in office the longest.

Also note: caretaker executives that only served a few weeks should not be included in the data frame at all (ie. Tollson and Edwards).

My starting data looks like this:

df.a1 <- data.frame(executive.name= rep(c("Johnson", "Alleghany", "Clarke",  "Roland", "Tollson", "Richards", "Peters", "Harrison", "Burr", "Diamond", "Edwards", "Gorman"),each=2),
                 date= rep(c("from 01-Jan-2000", "to 31-Dec-2002", "from 01-Jan-2003", "to 03-Mar-2004", "from 04-Mar-2004", "to 05-Nov-2005", "from 06-Nov-2005", "to 31-Dec-2007", "from 01-Jan-2008", "to 03-Mar-2008", "from 04-Mar-2008", "to 30-Nov-2009"), times=2),
                  district= c(rep(1001:1002, each=12)))

> df.a1
   executive.name             date district
1         Johnson from 01-Jan-2000     1001
2         Johnson   to 31-Dec-2002     1001
3       Alleghany from 01-Jan-2003     1001
4       Alleghany   to 03-Mar-2004     1001
5          Clarke from 04-Mar-2004     1001
6          Clarke   to 05-Nov-2005     1001
7          Roland from 06-Nov-2005     1001
8          Roland   to 31-Dec-2007     1001
9         Tollson from 01-Jan-2008     1001
10        Tollson   to 03-Mar-2008     1001
11       Richards from 04-Mar-2008     1001
12       Richards   to 30-Nov-2009     1001
13         Peters from 01-Jan-2000     1002
14         Peters   to 31-Dec-2002     1002
15       Harrison from 01-Jan-2003     1002
16       Harrison   to 03-Mar-2004     1002
17           Burr from 04-Mar-2004     1002
18           Burr   to 05-Nov-2005     1002
19        Diamond from 06-Nov-2005     1002
20        Diamond   to 31-Dec-2007     1002
21        Edwards from 01-Jan-2008     1002
22        Edwards   to 03-Mar-2008     1002
23         Gorman from 04-Mar-2008     1002
24         Gorman   to 30-Nov-2009     1002

I will like it to look like this:

df.a1.neat <- data.frame(executive.name= c("Johnson", "Johnson", "Alleghany", "Clarke", "Clarke", "Roland", "Roland", "Richards", "Richards", "Peters", "Peters", "Harrison", "Burr", "Burr", "Diamond", "Diamond", "Gorman", "Gorman"),
                 date= rep(c(2000, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009), times=2),
                  district= c(rep(1001:1002, each=9)))

   executive.name date district
1         Johnson 2000     1001
2         Johnson 2002     1001
3       Alleghany 2003     1001
4          Clarke 2004     1001
5          Clarke 2005     1001
6          Roland 2006     1001
7          Roland 2007     1001
8        Richards 2008     1001
9        Richards 2009     1001
10         Peters 2000     1002
11         Peters 2002     1002
12       Harrison 2003     1002
13           Burr 2004     1002
14           Burr 2005     1002
15        Diamond 2006     1002
16        Diamond 2007     1002
17         Gorman 2008     1002
18         Gorman 2009     1002

This is including all years, because the logic is easier to handle programmatically – if you have a reason that you wish some years removed, I’ll let you handle that aspect.

library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)

df.a1 |> 
  # Set up date_from and date_to vars for period calculation
  mutate(date_from = if_else(str_detect(date, "from"), date, lag(date)),
         date_to = if_else(str_detect(date, "to"), date, lead(date))) |> 
  select(-date) |> 
  # Remove duplicates
  distinct() |> 
  # Further preparing for period calculation
  mutate(date_from = str_remove(date_from, "from "),
         date_to = str_remove(date_to, "to "),
         across(date_from:date_to, dmy),
         year_from = year(date_from),
         year_to = year(date_to)) |>
  # Setting data to long format in order to vectorize period calculation
  pivot_longer(year_from:year_to, names_to = "start_end", values_to = "year") |> 
  # Filling in all years and then resorting the rows back to original sort
  group_by(executive.name, district) |> 
  complete(year = full_seq(min(year):max(year), 1)) |> 
  ungroup() |> 
  arrange(district, year) |> 
  # Calculating the period for each year, district, and name
  mutate(period = case_when(year(date_from) == year(date_to) ~ date_to - date_from, 
                            start_end == "year_from" ~ as.Date(paste0(year, "-12-31")) - date_from,
                            start_end == "year_to" ~ date_to - as.Date(paste0(year, "-01-01")),
                            TRUE ~ as.Date(paste0(year, "-12-31")) - as.Date(paste0(year, "-01-01"))), .by = year) |> 
  # Remove unwanted names for a given district and year
  filter(period == max(period), .by = c(district, year)) |> 
  # Set data frame to desired output
  select(executive.name, year, district)

# A tibble: 20 × 3
   executive.name  year district
   <chr>          <dbl>    <int>
 1 Johnson         2000     1001
 2 Johnson         2001     1001
 3 Johnson         2002     1001
 4 Alleghany       2003     1001
 5 Clarke          2004     1001
 6 Clarke          2005     1001
 7 Roland          2006     1001
 8 Roland          2007     1001
 9 Richards        2008     1001
10 Richards        2009     1001
11 Peters          2000     1002
12 Peters          2001     1002
13 Peters          2002     1002
14 Harrison        2003     1002
15 Burr            2004     1002
16 Burr            2005     1002
17 Diamond         2006     1002
18 Diamond         2007     1002
19 Gorman          2008     1002
20 Gorman          2009     1002

df.a1 |>
  mutate(dt = dmy(date), year = year(dt), day = yday(dt),
         years = list(full_seq(year, 1)), 
         days = ifelse(str_detect(date, fixed("to")), day, 365 - day + 1),
         days = ifelse(n_distinct(year) == 1, max(dt) - min(dt), days),
         .by = executive.name) |>
  filter(max(dt) - min(dt) > 100, .by = executive.name) |>
  slice_max(days, n = 1, by = c(year, district)) |>
  mutate(years = map2(row_number(), list(pick(everything())), (rn, d) {
    setdiff(d$years[[rn]], d$year[d$executive.name != d$executive.name[[rn]]])
  }), .by = district) |>
  select(executive.name, years, district) |>
  distinct() |>
  unnest(years)
# # A tibble: 20 × 3
#    executive.name years district
#    <chr>          <dbl>    <int>
#  1 Johnson         2000     1001
#  2 Johnson         2001     1001
#  3 Johnson         2002     1001
#  4 Alleghany       2003     1001
#  5 Clarke          2004     1001
#  6 Clarke          2005     1001
#  7 Roland          2006     1001
#  8 Roland          2007     1001
#  9 Richards        2008     1001
# 10 Richards        2009     1001
# 11 Peters          2000     1002
# 12 Peters          2001     1002
# 13 Peters          2002     1002
# 14 Harrison        2003     1002
# 15 Burr            2004     1002
# 16 Burr            2005     1002
# 17 Diamond         2006     1002
# 18 Diamond         2007     1002
# 19 Gorman          2008     1002
# 20 Gorman          2009     1002

Some information on how this works:

mutate: this creates a date column dt and extracts date components for other computations (e.g. year, day of the year). A list-column years is created that contains all the years an executive.name served terms. As you mentioned some years have overlap, this is take care of later. Also, days column is created. This contains the number of days an executive.name served within a year. This is used as a tie breaker for those that served part of a year. The year is assigned to the executive.name that served the largest portion of the year in a later step. A small correction is made to days to ensure the right output if they served less than a year.
filter: this keeps rows that served more than 100 days. This excludes “Tollson”. You can set this filter to be different based on your needs.
slice_max: this removes rows within a county and year when there was more than one executive.name. By taking the max days, we keep the row of the executive.name that served a higher proportion of the year within a district.
mutate: this step is a bit tricky, but years is for every year an executive.name served a term. However, in split years you want to assign that district and year to one individual that served the most days. This step essentially removes those years from a term span that were not assigned to that individual. For example, Roland started in 2005 in district 1001. Since Clarke served a larger portion of that year, that year is assigned to Clarke, so we remove it from Roland’s years span.
The last three steps (select, distinct, and unnest) simply format the data into a long format.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 06:28

Thẻ: rdataframedatedplyrtime

Thiết kế website giá rẻ

Danh mục

data frame breakdown by year, rounding up by duration of each period