What kind of attention is used in the diffusion model, specifically in the unet?
I’m reviewing the GRAD-TTS code provided by the paper
and I’m a bit confused about the type of attention used. Could someone help me identify what kind of attention this is and possibly provide some references for it?