How to get multimodal embeddings from CLIP model? I’m hoping to use CLIP to get a single embedding for rows of multimodal (image and text) data.