I am working with a longitudinal dataset of districts. blackout
indicates whether the district experienced at least one blackout during that year.
df <- data.frame(district= rep(c(1000:1003), each=4),
year= rep(c(2000:2003), rep=4),
blackout= c(0,0,1,1,0,0,0,0,1,1,1,1,0,1,0,1))
I want to calculate how many years it took for each district to experience their FIRST blackout.
The new df should look like this
df.1 <- data.frame(district= c(1000:1003),
time= c(3,5,1,2))
Note how blackouts can be intermittent but I only care for the count of the FIRST year.
Districts that survived the whole period without a blackout should be listed as 5.
Thank you
1
This method will not have trouble with missing or unsorted years.
library(tidyverse)
first_blackout <- df |>
filter(blackout == 1) |>
summarize(.by = district, first_blackout = min(year))
df |>
distinct(district) |>
left_join(first_blackout, by = join_by(district)) |>
mutate(time = first_blackout - min(df$year) + 1,
time = replace_na(time, 5))
summarise(df, time=case_when(
all(blackout==0)~5,
all(blackout==1)~1,
.default=1 + min(year[blackout==1]) - min(year[blackout==0])), .by=district)
district time
1 1000 3
2 1001 5
3 1002 1
4 1003 2
0
You can summarize
it by finding out the min
position of row in blackout
that equals 1 per district
. If every blackout
is zero, set it to 5 using ifelse
.
library(dplyr)
df |>
summarize(time = ifelse(all(blackout == 0), 5, min(which(blackout == 1))),
.by = district)
district time
1 1000 3
2 1001 5
3 1002 1
4 1003 2
1
With by
:
by(df, ~district, (d) with(d, {
j = suppressWarnings(min(which(blackout==1L)))
if(is.finite(j)) year[j]-year[1L]+1L else length(year)+1L })) |>
array2DF()
or
tapply(df$blackout, df$district, (x) max(cumsum(x==0L)*(cummax(x)==0L))+1L))
which borrows the logic given by user @M– Save the Data Dump in the comment below your question. Gives
district Value
1 1000 3
2 1001 5
3 1002 1
4 1003 2