In Transformers architecture mentioned in “Attention is All You Need” paper published by Google Research, in the Positional Encoding section, the authors used a combination of sine and cosine functions for positional embeddings.
My question is: Wouldn’t using only the sine function or only the cosine function be sufficient, since each level has a different frequency? What exactly does adding the cosine function contribute to the formula?
PE(pos,2i)=sin(pos/ 10000*(2i/d))
PE(pos,2i+1)=cos(pos/10000*(2i/d))
You can refer to this article:
Transformer Architecture: The Positional Encoding