I’m trying to understand logistic regression and gradient descent. How hard can it be, right? Well, I used the example from this website
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
logistic = glm(admit ~ gre + gpa + factor(rank), data = mydata, family="binomial")
betahat = logistic$coefficients
Okay, so that’s what the gradient descent should converge to. I brought the data into a classic format and implemented the gradient of the logit regression (or did I?)
# data in classic format
X = cbind(1, mydata$gre, mydata$gpa, mydata$rank==2, mydata$rank==3, mydata$rank==4)
y = mydata$admit
# logit gradient
logistic_cost_gradient <- function(X, y, b){
yhat = 1/(1+exp(-X %*% b))
grad = as.vector(crossprod(X, (y-yhat)))/nrow(X)
return(grad)
}
# sanity check: gradient at optimum close to zero
logistic_cost_gradient(X,y,betahat)
# sanity check passed
So let’s try some gradient descent:
b = rep(0,ncol(X)) # starting value
epoch = 1 # gradient descent steps
while(epoch<1000){
grad = logistic_cost_gradient(X,y,b)
b = b - grad/epoch # decreasing step size
epoch = epoch + 1
#print(b)
}
I know, it’s a crude stopping criterion, but it doesn’t seem to matter. The resulting values are garbage. What is going wrong?
My only guess is that it has to do with the scale of the second column of X, it’s much larger than all the others. I would expect that if I divide a column by 100, the coefficient for that column would have to be multiplied with 100. This seems to be true:
mydata$gre = mydata$gre/100
logistic_rescale = glm(admit ~ gre + gpa + factor(rank), data = mydata, family="binomial")
betahat_rescale = logistic_rescale$coefficients
Perhaps the gradient descent has issues with the scale of X? No matter what scaling of X I tried, I kept getting garbage results from gradient descent. Any hints? I’m loosing my head over this.