I have two data frames I’m trying to combine. One holds a sequence of dates the other holds a variable with its first observed and last observed dates.
Reprex for sequence of dates:
date.seq<-seq.Date(as.Date("01/01/2024" , '%m/%d/%Y'), by= 'day', length.out = 20)
date.df<-as.data.frame(date.seq)
> date.df
date.seq
1 2024-01-01
2 2024-01-02
3 2024-01-03
4 2024-01-04
5 2024-01-05
6 2024-01-06
7 2024-01-07
8 2024-01-08
9 2024-01-09
10 2024-01-10
11 2024-01-11
12 2024-01-12
13 2024-01-13
14 2024-01-14
15 2024-01-15
16 2024-01-16
17 2024-01-17
18 2024-01-18
19 2024-01-19
20 2024-01-20
Reprex For the start and end data frame:
x.start <- as.Date("01/01/2024" , '%d/%m/%Y')
y.start <- as.Date("10/01/2024" , '%d/%m/%Y')
z.start <-as.Date("17/01/2024" , '%d/%m/%Y')
x.end <- as.Date("10/01/2024" , '%d/%m/%Y')
y.end <- as.Date("17/01/2024" , '%d/%m/%Y')
z.end <- as.Date("20/01/2024" , '%d/%m/%Y')
df<- data.frame( var = c('x','y','z'),
start = c(x.start, y.start, z.start),
end = c(x.end, y.end, z.end))
> df
var start end
1 x 2024-01-01 2024-01-10
2 y 2024-01-10 2024-01-17
3 z 2024-01-17 2024-01-20
I’d like to see something that looks similar to this
date.seq Var
1 2024-01-01 X
2 2024-01-02 X
3 2024-01-03 X
4 2024-01-04 X
5 2024-01-05 X
6 2024-01-06 X
7 2024-01-07 X
8 2024-01-08 X
9 2024-01-09 X
10 2024-01-10 X
11 2024-01-11 Y
12 2024-01-12 Y
13 2024-01-13 Y
14 2024-01-14 Y
15 2024-01-15 Y
16 2024-01-16 Y
17 2024-01-17 Y
18 2024-01-18 Z
19 2024-01-19 Z
20 2024-01-20 Z
You can use an overlap join which will allow you to control what happens to rows that have no or multiple matches, e.g.
library(dplyr)
# Keep all rows in date.df and return all matches
left_join(date.df, df, join_by(between(date.seq, start, end)))
# Keep all rows in date.df and return first match
left_join(date.df, df, join_by(between(date.seq, start, end)), multiple = "first")
# Drop rows in date.df that don't match range in df
inner_join(date.df, df, join_by(between(date.seq, start, end)))
Take a look at the multiple
and relationship
arguments to control what should happen when there are multiple matches between keys, as it’s not completely clear from your post.
Playing around in base, we can approach this as, how long should each x, y, z be:
as.numeric(gsub('(\d+)', '\1', difftime(z.end, z.start)))
[1] 3
# could also be
# abs(as.numeric(gsub('(\d+)', '\1', difftime(z.start, z.end))))
c(abs(as.numeric(gsub('(\d+)', '\1', difftime(x.start, x.end)))), abs(as.numeric(gsub('(\d+)', '\1', difftime(y.start, y.end)))), abs(as.numeric(gsub('(\d+)', '\1', difftime(z.start, z.end)))))
[1] 9 7 3
# make the above `n_days <- c(abs...
# then make `Var`, Var <- c(rep('x'...
c(rep('x', n_days[1]), rep('y', n_days[2]), rep('z', n_days[3]))
use seq.Date
with c(
to make sequence of dates from your start ends, but first your have some thinking, fiddling around to do to decide which date is what Var (x,y,z) is date.seq
sum(n_days)
[1] 19
# whereas
length(c(seq.Date(x.start, x.end, by = 'days'), seq.Date(y.start, y.end, by = 'days'), seq.Date(z.start, z.end, by = 'days')))
[1] 22 # diff 3 - like the groups 3
# which might be
length(c(seq.Date(x.start, x.end -1, by = 'days'), seq.Date(y.start, y.end -1, by = 'days'), seq.Date(z.start, z.end -1, by = 'days')))
[1] 19
c(seq.Date(x.start, x.end -1, by = 'days'), seq.Date(y.start, y.end -1, by = 'days'), seq.Date(z.start, z.end -1, by = 'days'))
date.seq <- c(seq.Date
then
df <- data.frame(date.seq = date.seq, Var = Var)
There are plenty of ways to do this, but trying it this way, then using packages meant for this informs what the packages are doing.
df
date.seq Var
1 2024-01-01 x
2 2024-01-02 x
3 2024-01-03 x
4 2024-01-04 x
5 2024-01-05 x
6 2024-01-06 x
7 2024-01-07 x
8 2024-01-08 x
9 2024-01-09 x
10 2024-01-10 y
11 2024-01-11 y
12 2024-01-12 y
13 2024-01-13 y
14 2024-01-14 y
15 2024-01-15 y
16 2024-01-16 y
17 2024-01-17 z
18 2024-01-18 z
19 2024-01-19 z
2
Haven’t used cut()
or split()
for this. Another base option:
> n = mapply((x, y) sum(date.df$date.seq %in% seq.Date(x, y-1L, 1L)), df$start, df$end)
> n[1L] = n[1L]+1L
> date.df$var = rep(df$var, n)
> date.df
date.seq var
1 2024-01-01 x
2 2024-01-02 x
3 2024-01-03 x
4 2024-01-04 x
5 2024-01-05 x
6 2024-01-06 x
7 2024-01-07 x
8 2024-01-08 x
9 2024-01-09 x
10 2024-01-10 x
11 2024-01-11 y
12 2024-01-12 y
13 2024-01-13 y
14 2024-01-14 y
15 2024-01-15 y
16 2024-01-16 y
17 2024-01-17 y
18 2024-01-18 z
19 2024-01-19 z
20 2024-01-20 z