I’m hoping to use CLIP to get a single embedding for rows of multimodal (image and text) data.
Say I have the following model:
<code>from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def convert_image_data_to_tensor(image_data):
return torch.tensor(image_data)
dataset = df[['image_data', 'text_data']].to_dict('records')
image_tensor = convert_image_data_to_tensor(data['image_data'])
inputs = processor(text=text, images=image_tensor, return_tensors=True)
<code>from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def convert_image_data_to_tensor(image_data):
return torch.tensor(image_data)
dataset = df[['image_data', 'text_data']].to_dict('records')
embeddings = []
for data in dataset:
image_tensor = convert_image_data_to_tensor(data['image_data'])
text = data['text_data']
inputs = processor(text=text, images=image_tensor, return_tensors=True)
with torch.no_grad():
output = model(**inputs)
</code>
from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def convert_image_data_to_tensor(image_data):
return torch.tensor(image_data)
dataset = df[['image_data', 'text_data']].to_dict('records')
embeddings = []
for data in dataset:
image_tensor = convert_image_data_to_tensor(data['image_data'])
text = data['text_data']
inputs = processor(text=text, images=image_tensor, return_tensors=True)
with torch.no_grad():
output = model(**inputs)
I want to get the embeddings calculated in output
. I know that output
has the addtributes text_embeddings
and image_embeddings
, but I’m not sure how they interact later on. If I want to get a single embedding for each record, should I just be concatenating these attributes together? Is there another attribute that combines the two in some other way?
These are the attributes stored in output:
['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']
<code>print(dir(output))
['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']
</code>
print(dir(output))
['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']
Also, is there a way to specify the size of the embedding that CLIP outputs? Similar to how you can specify the embedding size in BERT configs?
Thanks in advance for any help here. Feel free to correct me if I’m misunderstanding anything critical here.