I would like to use the einstein notation to compute a batched dot-product multi-head attention. I have matrices Q, K and V that all have the dimensions (batch position n_heads dim_head). How can i compute the attention scores QKT, and how do i then multiply this with the vector v, to finally get attention out of this?
I would be thankful for any pointers!
I tried to create the attention scores using this expression with numpy einsum:
a = np.einsum('bqnd,bknd->bnqk', q, k)
where:
- b is the batch,
- q is the position in the q matrix
- n is the number of heads
- d is the dimension of the heads
From here, I am a bit at a loss on how to compute the weighted sums of v and then how to add all the heads.