I have a data da
that has some date
frequency distribution (see below). I have another data db
, in which each id may have one or multiple records. Is there a feasible way to get one record (no more and no less) for each group, so that the sampled distribution of date in db
is as close as possible to that of da
?
library(data.table)
library(dplyr)
library(lubridate)
# da is the data to be emulated
da = data.table(id = paste0('a', 1:7),
date = ymd(c('2021-1-10',
rep('2021-1-11', 2),
rep('2021-1-12', 3),
'2021-1-13')))
da
da[,.N,(date)][,.(date, N, perc = N/sum(N))]
# date N perc
# 1: 2021-01-10 1 0.1428571
# 2: 2021-01-11 2 0.2857143
# 3: 2021-01-12 3 0.4285714
# 4: 2021-01-13 1 0.1428571
# need to get only one (no more and no less) sample for each id
# to emulate the distribution of date in da
set.seed(123)
db = structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L),
date = structure(c(18638, 18639, 18639, 18640, 18640, 18637,
18640, 18637, 18638, 18639), class = "Date")),
class = c("data.table", "data.frame"))
> db
id date
1: 1 2021-01-11
2: 2 2021-01-12
3: 3 2021-01-12
4: 3 2021-01-13
5: 3 2021-01-13
6: 4 2021-01-10
7: 5 2021-01-13
8: 6 2021-01-10
9: 6 2021-01-11
10: 8 2021-01-12
For a probabilistic approach (works well if the data is fairly large), you could use the strata
function from the sampling package.
Here is the distribution to be emulated, which we save to an R object.
da.N <- da[,.N,(date)][,.(date, N, perc = N/sum(N))]; da.N
date N perc
<Date> <int> <num>
1: 2021-01-10 1 0.1428571
2: 2021-01-11 2 0.2857143
3: 2021-01-12 3 0.4285714
4: 2021-01-13 1 0.1428571
Merge the two tables on date
so that the percent column is in the data we want to sample from.
db <- da.N[db, on='date']
Then perform systematic unequal probability sampling on db. This will randomly select one row (size=1) from each id
(strata) with inclusion probabilities (“pik”) given by the distribution of the dates in da (column perc
).
library(sampling)
set.seed(12) # for reproducibility (omit in reality)
s <- strata(db,
stratanames="id",
size=rep(1, length(unique(db$id))),
pik=db$perc, # must be db$perc
method="systematic")
Get the data and check the frequency distribution of the dates.
db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))]
date N perc
<Date> <int> <num>
1: 2021-01-10 2 0.2857143
2: 2021-01-11 1 0.1428571
3: 2021-01-12 2 0.2857143
4: 2021-01-13 2 0.2857143
Hard to say if it worked with such a small sample. Let’s try it for a larger dataset.
set.seed(123)
db <- data.table(id = sample(1:400, 10000, TRUE),
date = ymd(rep('2021-1-10', 100),
rep('2021-1-11', 100),
rep('2021-1-12', 100),
rep('2021-1-13', 100)))
db <- da.N[db, on='date']
set.seed(123)
s <- strata(db,
stratanames="id",
size=rep(1, length(unique(db$id))),
pik=db$perc,
method="systematic")
db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))][order(date)]
date N perc
<Date> <int> <num>
1: 2021-01-10 71 0.1775 # 0.1428571
2: 2021-01-11 109 0.2725 # 0.2857143
3: 2021-01-12 166 0.4150 # 0.4285714
4: 2021-01-13 54 0.1350 # 0.1428571
This is fairly close to the desired distribution (numbers after the #), and the ids should be unique.
db2[, .N, (id)]
id N
<int> <int>
1: 179 1
2: 14 1
3: 195 1
4: 306 1
5: 118 1
---
396: 233 1
397: 192 1
398: 15 1
399: 62 1
400: 97 1