I’m trying to create a Tweedie Regression in statsmodels. The regression basically has three categorical predictors which have four levels each. To ilustrate, here is an example:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = pd.DataFrame({
'V1': pd.Categorical(['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D']),
'V2': pd.Categorical(['W', 'X', 'Y', 'Z', 'W', 'X', 'Y', 'Z']),
'V3': pd.Categorical(['K', 'L', 'M', 'N', 'K', 'L', 'M', 'N']),
'y': [5.1, 7.3, 6.9, 8.0, 5.4, 7.1, 6.8, 8.2]
})
formula = 'y ~ C(V1, Treatment('A')) + C(V2, Treatment('W')):C(V3, Treatment('K'))'
model = smf.GLM.from_formula(formula, data, family=sm.families.Tweedie())
result = model.fit()
print(result.summary())
I used to do this kind of regression using SAS, and SAS do not return the interaction with the reference. For example, in this case, SAS do not include any interaction with V3(K). Here is the analogous code in SAS:
data example;
input V1 $ V2 $ V3 $ y;
datalines;
A W K 5.1
B X L 7.3
C Y M 6.9
D Z N 8.0
A W K 5.4
B X L 7.1
C Y M 6.8
D Z N 8.2
;
run;
proc hpgenselect data=example;
class V1 (ref='A') V2 (ref='W') V3 (ref='K');
model y = V1 V2*V3 / dist=tweedie link=log;
run;
However, in statsmodel, this interaction is included. Does anyone know why this happen? And how to do something similar to SAS (without the interaction with the reference)?
2