Masked Query Gradient Flow to Keys and Values
I was wondering why the gradient in this scaled dot product example does not flow to the key and value. What am I doing wrong? How can I use padded batches with different target sequence lengths? Can I pad keys/values and queries together?