I have a hypothetical dataset with 4 mutually exclusive treatments and a continuous variable, as well as a response variable. Each treatment is coded as a dummy variable and is either false or true.
I am trying to fit a linear regression model with an intercept fixed at zero (y ~ 0 + treat1 + treat2 + treat3 + treat4 + contvar) using the lm() function in R. I want to get coefficient estimates for all four treatments, as well as the continuous variable.
Here’s a reproducible example of my issue:
# Sample size
n <- 100
# Generate predictors, treatments 1 - 4 that are mutually exclusive, and a continuous variable
data <- data.frame(
treat1 = c(rep(TRUE, 25), rep(FALSE, 75)),
treat2 = c(rep(FALSE, 25), rep(TRUE, 25), rep(FALSE, 50)),
treat3 = c(rep(FALSE, 50), rep(TRUE, 25), rep(FALSE, 25)),
treat4 = c(rep(TRUE, 25), rep(FALSE, 75)),
contvar = sample(0:100, n)/100
)
# Define means for each treatment
mean1 <- rnorm(100, -0.5, 0.1) ; mean2 <- rnorm(100, -0.2, 0.1) ; mean3 <- rnorm(100, 0.2, 0.1) ; mean4 <- rnorm(100, 0.5, 0.1)
# Generate response variable y based on the treatment means and the value of the continuous variable
data$y <- mean1 * data$treat1 + mean2 * data$treat2 + mean3 * data$treat3 + mean4 * data$treat4
data$y <- data$y * data$contvar
# Fit a no-intercept model
model0 <- lm(y ~ 0 + treat1 + treat2 + treat3 + treat4 + contvar, data = data)
# Summarize the no-intercept model
summary(model0)
The generated outcome reads:
Call:
lm(formula = y ~ 0 + treat1 + treat2 + treat3 + treat4 + contvar,
data = data)
Residuals:
Min 1Q Median 3Q Max
-0.258377 -0.035568 0.007104 0.037690 0.203017
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
treat1FALSE 0.01859 0.01803 1.031 0.305
treat1TRUE -0.02145 0.01881 -1.140 0.257
treat2TRUE -0.09835 0.02006 -4.904 3.88e-06 ***
treat3TRUE 0.09957 0.01985 5.017 2.44e-06 ***
treat4TRUE NA NA NA NA
contvar -0.04053 0.02468 -1.642 0.104
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07014 on 95 degrees of freedom
Multiple R-squared: 0.5521, Adjusted R-squared: 0.5285
F-statistic: 23.42 on 5 and 95 DF, p-value: 2.917e-15
There is a coefficient for treat1FALSE (unexpected) and for treat1TRUE, but no coefficient for treat4TRUE (unexpected). I understand that getting coefficient estimates for all treatments would not work with a random intercept model, but with a zero-intercept model I expected it would.
Just to illustrate, this is what happens if I fit the model with a random intercept:
Call:
lm(formula = y ~ treat1 + treat2 + treat3 + treat4 + contvar,
data = data)
Residuals:
Min 1Q Median 3Q Max
-0.258377 -0.035568 0.007104 0.037690 0.203017
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01859 0.01803 1.031 0.3049
treat1TRUE -0.04004 0.01987 -2.015 0.0468 *
treat2TRUE -0.09835 0.02006 -4.904 3.88e-06 ***
treat3TRUE 0.09957 0.01985 5.017 2.44e-06 ***
treat4TRUE NA NA NA NA
contvar -0.04053 0.02468 -1.642 0.1039
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07014 on 95 degrees of freedom
Multiple R-squared: 0.5466, Adjusted R-squared: 0.5275
F-statistic: 28.63 on 4 and 95 DF, p-value: 1.296e-15
The coefficient of treat1FALSE in the no-intercept model equals the coefficient of the intercept in the regular model. How should I interpret this?
Finally, if I fit a linear model manually, by building a no-intercept linear model function, and then minimising the residual sum of squares with optim, I do get coefficients for all treatments and for the continuous variable:
lm_manual <- function(b){
b1 <- b[1]; b2 <- b[2]; b3 <- b[3]; b4 <- b[4]; bcont <- b[5]
y <- 0 + b1*data$treat1 + b2*data$treat2 + b3*data$treat3 + b4*data$treat4 + bcont*data$contvar
SSR <- sum((y - data$y)^2)
}
out <- optim(par = rep(0, 5), fn = lm_manual)
(out$par) # coefficients for treatment 1, treatment 2, treatment 3, treatment 4, and the continuous variable
This reads: -0.39267220 -0.08900373 0.11117684 0.36312218 -0.02455234
which is what I was expecting to get as an answer from the lm(formula = y ~ 0 + treat1 + treat2 + treat3 + treat4 + contvar, data = data)
model as well.
How can I get such coefficient estimates using lm?
aleit is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.