I want to conduct a Difference-in-Differences (DiD) design study. I have two groups that I want to keep equal regarding certain variables: sex, age_group, education, income, ideology, and identity. Matching would be the ideal method, but I cannot afford to lose observations since I already have few. Instead, I thought of weighting respondents by these same variables. My idea is to examine the distribution of my whole sample and ensure that the distribution of these variables is “equal” in both groups.
Here is an overview of my data:
ESTU PROV date year_month sex age_group sector_eu sector_eu_without sector_eu_fact
1 3202 8 2017-12-28 2017-12 2 75-79 Import-oriented sectors Import-oriented sectors Import-oriented sectors
2 3202 8 2017-12-28 2017-12 1 45-49 Import-oriented sectors Import-oriented sectors Import-oriented sectors
3 3202 8 2017-12-28 2017-12 2 45-49 Export-oriented sectors Export-oriented sectors Export-oriented sectors
4 3202 8 2017-12-28 2017-12 2 55-59 Import-oriented sectors Import-oriented sectors Import-oriented sectors
5 3202 8 2017-12-28 2017-12 1 35-39 Import-oriented sectors Import-oriented sectors Import-oriented sectors
6 3202 8 2017-12-28 2017-12 2 25-29 Non-tradable sectors <NA> Non-tradable sectors
7 3202 8 2017-12-28 2017-12 1 80+ Non-tradable sectors <NA> Non-tradable sectors
8 3202 8 2017-12-28 2017-12 2 40-44 Non-tradable sectors <NA> Non-tradable sectors
9 3202 8 2017-12-28 2017-12 2 60-64 Non-tradable sectors <NA> Non-tradable sectors
10 3202 8 2017-12-28 2017-12 2 65-69 Non-tradable sectors <NA> Non-tradable sectors
11 3202 8 2017-12-28 2017-12 1 35-39 Import-oriented sectors Import-oriented sectors Import-oriented sectors
12 3202 8 2017-12-28 2017-12 1 60-64 Non-tradable sectors <NA> Non-tradable sectors
13 3202 8 2017-12-28 2017-12 2 50-54 Non-tradable sectors <NA> Non-tradable sectors
14 3202 8 2017-12-28 2017-12 2 18-24 Non-tradable sectors <NA> Non-tradable sectors
15 3202 8 2017-12-28 2017-12 1 18-24 Non-tradable sectors <NA> Non-tradable sectors
16 3202 8 2017-12-28 2017-12 1 18-24 Export-oriented sectors Export-oriented sectors Export-oriented sectors
education indep ideology identity income work_sit_fact class media media_intns PESO
1 1 NA 2 4 7 Retired 3 Spanish media 1 1.8325
2 2 1 1 4 8 Retired 3 Catalan media 1 1.8325
3 2 NA 5 3 <NA> Unemployed 3 Spanish media 1 1.8325
4 1 0 5 3 5 <NA> 4 Spanish media 1 1.8325
5 5 1 3 5 10 Currently working 3 Catalan media 1 1.8325
6 4 0 5 3 <NA> Currently working 3 Spanish media 1 1.8325
7 <NA> 0 3 3 6 Retired 3 Spanish media 1 1.8325
8 5 1 2 4 10 Currently working 2 Spanish media 1 1.8325
9 1 0 4 3 5 Unemployed 3 Spanish media 2 1.8325
10 4 0 8 3 7 Retired 3 Spanish media 2 1.8325
11 5 1 1 5 7 Currently working 3 Catalan media 4 1.8325
12 5 1 4 4 8 Retired 3 Spanish media 4 1.8325
13 5 0 3 3 9 Currently working 3 Spanish media 1 1.8325
14 4 1 2 4 6 Currently working 3 Catalan media 2 1.8325
15 4 0 2 4 6 Currently working 3 Catalan media 1 1.8325
16 4 0 3 3 8 <NA> 3 Catalan media 1 1.8325
I would calculate the weights like this:
attach(DATA17_slct)
mytable <- table(sex)
prop.table(mytable)
mytable <- table(education)
prop.table(mytable)
mytable <- table(ideology)
prop.table(mytable)
mytable <- table(identity)
prop.table(mytable)
mytable <- table(income)
prop.table(mytable)
mytable <- table(age_group)
age_group_proportions <- prop.table(mytable)
age_group_proportions
detach(DATA17_slct)
Now, I have tried two different alternatives to apply the weights.
-
With the crunch package, however, for this purpose, I need an API key.
So, further elaboration of this point might be unnecessary here. -
With the survey package:
I have in so far only conducted this for one variable (sex). Is there a way I could do this code for several variables at once? Because when I try to do so, I get the error message that the variables are not of the same length, and thus it seems to not be possible to do this process for multiple variables at once.
DATA17_slct_unweighted <- svydesign(ids = ~1,
data = DATA17_slct,
weights = NULL)
sex_dist <- tibble(sex = c("1", "2"), Freq = nrow(DATA17_slct)*c(0.4853061, 0.5146939))
dummy_sex_rake <- rake(design = DATA17_slct_unweighted,
sample.margins = list(~sex),
population.margins = list(sex_dist))
Also, when I have variables with at least one missing value in them, I get an error code (see below the example with my education variable). However, I do not want to exclude the observations with missing values from the variables. How can I circumvent this problem?
DATA17_slct_unweighted <- svydesign(ids = ~1,
data = DATA17_slct,
weights = NULL)
education_dist <- tibble(education = c("1", "2", "3", "4", "5"), Freq = nrow(DATA17_slct)*c(0.1928934, 0.2284264, 0.1404399, 0.2229272, 0.2153130))
dummy_education_rake <- rake(design = DATA17_slct_unweighted,
sample.margins = list(~education),
population.margins = list(education_dist))
Error in na.fail.default(list(education = c(1L, 2L, 2L, 1L, 5L, 4L, NA, :
missing values in object
Now, returning to my principal question, when I want to conduct the analyses (OLS regressions), how would I implement the weights in a OLS regression? Because I do also have another weight Variable (called PESO) to assure the representativeness of my sample in regard to the population I want to study. In sum, I end up having two weight variables that I need to combine.
Again, my questions are:
- Is the survey package and the coded provided above adequate for my problem? If so, how can I conduct this for several variables at a time?
- How can I combine the two weight variables to include them in my OLS regression?