Replacing stable diffusion v2.1 text encoder with image encoder
I’m trying to replace the text encoder of Stable Diffusion with a corresponding image encoder, so that I can feed images instead of text. The stable diffusion hugging face documentation says that it uses pretrained text encoder from OpenCLIP ViT/H
model. Since the text encoder and image encoder of CLIP share the same latent space, I can easily replace the text encoder with image encoder and the model should work fine without any further training.