Suppose we have an image with size torch.Size([1, 3, 336, 336])
and encode it using CLIP with size torch.Size([1, 577, 1024])
, How to recover the origin image with this latent feature map?
I’ve tried to use the StabilityAI/stable-diffusion-2-1-unclip
, which is finetuned from the sd2 to accept the image embedding. However, I found it only takes the cls token and ignores the others. Is there any way to fully utilized the whole embeddings? Or a sd2 is required to be finetuned?