So I’m not an expert in programming and neither in statistics. I have basic udnerstanding and I am running into this problem that is not logical to me. I have been trying to fix it for 3 hours straight and don’t understand how anyone is supposed to solve this. I have a database of loans that defaulted or not defaulted. I want to predict probabilities of default (PD) for loans using three variables: EmploymentStatus, IsBorrowerHomeOwner and currentCreditLines.
Here is the thing: If I transform the EmploymentStatus variable to a factor, it will list 8 factors. I have checked the data and this works:
data$EmploymentStatus = as.factor(data$EmploymentStatus)
When I then fit a logit-model in order to be able to predict log-odds, everything seems to work:
logit_model = glm(LoanStatus ~ EmploymentStatus + CurrentCreditLines + IsBorrowerHomeowner, data=train_data, family=binomial)
summary(logit_model)
The problem is when I then want to predict log-odds. I get an error:
Fehler in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
Faktor 'EmploymentStatus' hat neue Stufen Not available
which means “factor EmploymentStatus has new levels”.
Note that this happens in-sample, so there are no new levels. I was baffled and neither ChatGPT, nor google helped. After hours of scraching my head I noticed that the glm-model does not have all 8 levels. This is due to the fact that one level is the baseline level. When the model then tries to predict, it is unaware that it dropped one var for it to be the baseline and just returns this error.
I have absolutely no idea what to do to handle this professionally. Help is greatly appreciated.