I am currently working on my assignment about LSTMs and want readers to understand why we even use those. I can explain, why the vanishing / exploding gradient problem is happening with normal RNNs. But after doing the math on BPTT on LSTMs i am left with a term that multiplies the value of the forget gate many times.
Our argumentation for the V/EGP on RNNs is, that derivatives of our activation function many times multiplied is really small (vgp), and the multiplication of the weights can get big really fast (egp).
LSTMs multiply the forget gate many times, which can only store values from 0 to 1.
Why are LSTMs still immune to VGP?
Silas Schröder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.