How does CLIP handle the image features and text features for comparison?
I am learning CLIP and I read the following information:
arxiv
openai/clip-vit-large-patch14
huggingface-clip-api
I am learning CLIP and I read the following information:
arxiv
openai/clip-vit-large-patch14
huggingface-clip-api