I’m trying to replace the text encoder of Stable Diffusion with a corresponding image encoder, so that I can feed images instead of text. The stable diffusion hugging face documentation says that it uses pretrained text encoder from OpenCLIP ViT/H
model. Since the text encoder and image encoder of CLIP share the same latent space, I can easily replace the text encoder with image encoder and the model should work fine without any further training.
However, the text embeddings I am getting from Stable diffusion text encoder and the OpenCLIP ViT/H text encoder are different.
I get the below from stable diffusion text encoder
prompt = 'dress, long sleeve'
model_key = "./models--stabilityai--stable-diffusion-2-1-base/"
pipe = StableDiffusionPipeline.from_pretrained(model_key, torch_dtype=self.precision_t)
self.tokenizer = pipe.tokenizer
self.text_encoder = pipe.text_encoder
inputs = self.tokenizer(prompt, padding='max_length', max_length=self.tokenizer.model_max_length, return_tensors='pt')
embeddings = self.text_encoder(inputs.input_ids.to(self.device))[0]
I get the below text embeddings using OpenCLIP text encoder
model, _, preprocess = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k')
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-H-14')
text = tokenizer([prompt])
text_features = model.encode_text(text)
A main difference is that the embeddings
from stable diffusion text encoder is of size (1, 77, 1024)
, whereas text_features
from OpenCLIP text encoder is of size (1, 1024)
.
I have two questions?
- What text encoder from OpenCLIP should I use to get the same embeddings as Stable diffusion text encoder?
- What image encoder corresponds to the text encoder in stable diffusion? i.e. which image encoder shares the same latent space as the text encoder?