In Transformer models, sequences of variable lengths are typically padded to the maximum length in a batch. However, if my sequence lengths vary significantly, the batch may contain a substantial amount of padding (potentially over 50%).
I am curious about the following:
When PyTorch computes the Transformer, do padding tokens impact calculation speed negatively?
Does the presence of the attention mask allow the model to effectively skip over padding tokens, resulting in only a minimal performance impact?
Overall, how effective is the attention mask? If I have a sparse attention mask with only 10% non-zero values, does the computation effectively reduce to approximately 10%?
Thank you for your insights!