I have the following model that I ran using statsmodels.formula.api
:
result=smf.ols('post_sls ~ test + pre_sls', data=df).fit().summary()
Post_sls and pre_sls are floats, test is an integer taking a value of 0 or 1.
I am getting the following output:
Statsmodels output
The values for intercept and test coefficients are wrong, as I keep getting different results using lm in R, linear model in sklearn and by calculating the coefficients manually.
** R code and output:**
model <- lm(post_sls ~ test + pre_sls, data = df)
result <- summary(model)
R output
Sklearn code and output:
from sklearn.linear_model import LinearRegression
X = df[['test', 'pre_sls']]
y = df['post_sls']
model = LinearRegression().fit(X, y)
print(f'Intercept: {model.intercept_}')
print(f'Coefficients: {model.coef_}')
print(f'R^2: {model.score(X, y)}')
Intercept: 4.128324040176458
Coefficients: [0.18193744 0.0978311 ]
R^2: 0.06522976694251192
Manual calculation code and output:
import patsy
formula = 'post_sls ~ test + pre_sls'
y, X = patsy.dmatrices(formula, df)
beta = np.linalg.inv(X.T @ X) @ (X.T @ y)
print(beta)
[[4.12832404]
[0.18193744]
[0.0978311 ]]
As you can see, all the used methods deliver the same coefficients, except for the statsmodels. The interesting thing is that the same of the intercept coefficient and the test coefficient is the same across models, but somehow statsmodels overestimate the intercept and underestimate the test coefficient.
Does anyone have an idea of what is going wrong here?
Thanks,
Alona
Alona Kolomiiets is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.