I tried to implement a DQN model myself, and according to what I read from studies the output the network should output is the same output it output, except for the Q value of the best operation. I implemented it like this:
public override Mat CalculateDesiredOutput(Experience et)
{
Mat qValuesCurrent, qValuesNext, qTarget;
// Using Feed Forward to compute Q-Values for Current State using the Policy network
qValuesCurrent = FeedForward(et.state);
// Initialize the Q-Target with Q-Values of the current state to make the derivative of the MSE 0
qTarget = new Mat(qValuesCurrent);
// Check if the episode just ended
if (et.isTerminal)
{
// If so, it will be just the reward
// Q(s, a) = r
qTarget[et.action, 0] = et.reward;
}
else
{
// Otherwise, I will update the Q-Value using Bellman equation
// Q(s, a) = r + gamma * max(Q(s', a'))
// Using Feed Forward to compute Q-Values for Next State using the Target network
qValuesNext = this.targetNetwork.FeedForward(et.nextState);
double MaxQNext = qValuesNext.Max();
double updatedQ = et.reward + gamma * Math.Min(MaxQNext, 100.0);
qTarget[et.action, 0] = updatedQ;
}
// update the current step
currentStep++;
// In a case we got to the update rate we will copy the policy network to the target network
if (currentStep == this.updateRate)
{
this.targetNetwork = this.Clone();
this.currentStep = 0;
}
return qTarget;
}
But the gradients in this shape tend to infinity, because all but one of the gradients are 0. How to fix this?
I don’t understand the loss function of DQN exactly
לירון כהן is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.