I created a neural network that had 10 outputs that were then finally activated by the leaky RELU function. The cost function i used for this network was the mean squared error function multiplied by 0.5 to make the computation of the derivative simple. 0.5(Expected output – network output)^2. The derivative of this is (network output – expected output). Please note the network i created would train on one hot encoding values label values. So the outputs 1 to 10 would represent 10 classes and the output would be mapped to classes and would output a 1 if the class is identified and a zero if not identified.
Now i’ve been reading up and trying to understand the cross entropy loss function and information on this is a bit confusing because the derivative of the cross entropy loss function is essentially the same as the derivative of the 0.5 * mean squared error function. So this eventually led me to understand, maybe incorrectly, that the cross entropy loss function is maybe the theoretical derivation to show that the best way to send error information back through the network is to use (network output – expected output) based on information theory. Almost like a proof of where this comes from? Other than that i don’t see where the actual value calculated by the cross entropy loss function is used other than giving an indication of how the network is performing with a single number.
I have asked chat gpt about this but it is convinced it is used to inform how weights should change yet fails to show me where and how.