Thiết kế website giá rẻ

Question

I’m hoping to use CLIP to get a single embedding for rows of multimodal (image and text) data.

Say I have the following model:

<code>from PIL import Image

import torch

from transformers import CLIPProcessor, CLIPModel

import torchvision.transforms as transforms

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def convert_image_data_to_tensor(image_data):

return torch.tensor(image_data)

dataset = df[['image_data', 'text_data']].to_dict('records')

embeddings = []

for data in dataset:

image_tensor = convert_image_data_to_tensor(data['image_data'])

text = data['text_data']

inputs = processor(text=text, images=image_tensor, return_tensors=True)

with torch.no_grad():

output = model(**inputs)

</code>

<code>from PIL import Image import torch from transformers import CLIPProcessor, CLIPModel import torchvision.transforms as transforms model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") def convert_image_data_to_tensor(image_data): return torch.tensor(image_data) dataset = df[['image_data', 'text_data']].to_dict('records') embeddings = [] for data in dataset: image_tensor = convert_image_data_to_tensor(data['image_data']) text = data['text_data'] inputs = processor(text=text, images=image_tensor, return_tensors=True) with torch.no_grad(): output = model(**inputs) </code>

from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def convert_image_data_to_tensor(image_data):
    return torch.tensor(image_data)

dataset = df[['image_data', 'text_data']].to_dict('records')

embeddings = []
for data in dataset:
    image_tensor = convert_image_data_to_tensor(data['image_data'])
    text = data['text_data']

    inputs = processor(text=text, images=image_tensor, return_tensors=True)
    with torch.no_grad():
        output = model(**inputs)

I want to get the embeddings calculated in output. I know that output has the addtributes text_embeddings and image_embeddings, but I’m not sure how they interact later on. If I want to get a single embedding for each record, should I just be concatenating these attributes together? Is there another attribute that combines the two in some other way?

These are the attributes stored in output:

<code>print(dir(output))

['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']

</code>

<code>print(dir(output)) ['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output'] </code>

print(dir(output))

['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']

Also, is there a way to specify the size of the embedding that CLIP outputs? Similar to how you can specify the embedding size in BERT configs?

Thanks in advance for any help here. Feel free to correct me if I’m misunderstanding anything critical here.

Thiết kế website giá rẻ

Danh mục

How to get multimodal embeddings from CLIP model?