I have a question when should I use positional embedding in Transformers. I want to make cross attention layer with m queries and n keys. More specifically, I will reduce the image to 512x7x7 using resnet and use 49 vectors as keys. In this case, where should I add positional embedding? Queries? Keys? Or both? Or neither?
I first tried learning both without positional embeddings, and later tried learning them by adding positional embeddings only to the key.Due to capacity issues, I ran about 20 epochs using 1000 image data, but no significant differences were found.
user25377621 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.