I want to calculate estimated group mean scores in a 2×2 Gaussian regression after obtaining the regression coefficients. Here is toy data. 100 observations each of region – a and b – and sex – m and f. I have designed the scores so there is a 5-point difference on average between regions a and b but no difference between m and f.
set.seed(1234)
d <- data.frame(region = factor(rep(letters[1:2],each=100)),
sex = factor(rep(c("m", "f"),times=100)),
score = round(x = c(rnorm(100, mean = 5, sd = 1),
rnorm(100, mean = 10, sd = 1)),
digits = 1))
Now I will use the model.matrix()
function to obtain contrast coefficients for each observation, based on its group membership. I will use treatment coding, that is [0,1]
with region a
and sex m
as the reference levels for each.
model.matrix(object = score ~ region*sex,
data = d,
contrasts.arg = list(region = contr.treatment(nlevels(d$region)),
sex = contr.treatment(nlevels(d$region)))) -> cmTreat
Now we can use the model matrix directly in the regression using the lm()
function. We specify 0 + terms
because the model matrix already contains an intercept.
(lm(d$score ~ 0 + cmTreat) -> lmTreat)
# output
# Call:
# lm(formula = d$score ~ 0 + cmTreat)
#
# Coefficients:
# cmTreat(Intercept) cmTreatregion2 cmTreatsex2 cmTreatregion2:sex2
# 4.814 5.132 0.056 0.140
The regression has retrieved the main effects and interactions. But what if we want to get estimated marginal means, specifically the estimated mean in each ‘cell’ of the 2 x 2: region a – female, region a – male, region b – female, region b – male.
We can do this manually via the attributes of the model matrix.
treatCoefs <- coef(lmTreat) # assign the vector of coefficients a name
# mean in region a female: intercept[1] + region[0] + sex[0] + region[0]*sex[0]
regionA_f <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][1] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][1] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][1]*attr(cmTreat, which = "contrasts")$sex[,1][1]
# mean in region a male: intercept[1] + region[0] + sex[1] + region[0]*sex[1]
regionA_m <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][1] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][2] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][1]*attr(cmTreat, which = "contrasts")$sex[,1][2]
# mean in region b female: : intercept[1] + region[1] + sex[0] + region[1]*sex[0]
regionB_f <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][2] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][1] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][2]*attr(cmTreat, which = "contrasts")$sex[,1][1]
# mean in group b male: intercept[1] + region[1] + sex[1] + region[1]*sex[1]
regionB_m <-treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][2] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][2] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][2]*attr(cmTreat, which = "contrasts")$sex[,1][2]
Now if we compare the actual group means to the estimated means (apologies non dplyr people)…
(library(dplyr)
d %>%
group_by(region, sex) %>%
summarise(actualMean = mean(score)) %>%
add_column(estMeans = c(regionA_f, regionA_m, regionB_f, regionB_m))
# # A tibble: 4 × 4
# # Groups: region [2]
# region sex actualMean estMeans
# <fct> <fct> <dbl> <dbl>
# 1 a f 4.81 4.81
# 2 a m 4.87 4.87
# 3 b f 9.95 9.95
# 4 b m 10.1 10.1
So this works great. “What is the problem?” I hear you ask. Well, you saw how much code was required to get the estimated means for each group. And I can do it. But I was wondering “Is there was an easier way to do this manually?”.
I know I can use Russ Lenth’s excellent emmeans
package and do use that a lot, but I wanted to learn how to do it manually in a more elegant way. I know nothing of matrix algebra and not a lot about contrast matrices. I just can’t help feeling as if there is a better way (one whose method might adapt better across different designs and levels).
p.s. this question may have been better suited to cross validated but I thought I would try here first as it is just r-specific enough to warrant posting on SO.