Conditions:
- Assume you have a DRL SAC implementation.
- You train it as usual from Replay Buffer with uniform sampling with replacement.
Assume you change the target of the Q-network from:
T = r + gamma * Q(s’,*)
To:
T = (r + gamma * Q(s’,*)) / 2
Then:
-
To which values will now converge the Q-values? I mean, will it converge to an average reward as in average reward RL? Diverges?
-
I have done some tests and I have convergence to some value, but I don’t know to which value it is converging to.
-
Is there a simple proof that the objective: T = (r + gamma * Q(s’,*)) / 2 does not converge to the average reward?