I am a student studying CLIP.
I have a question about the loss calculation process of CLIP, so I decided to post my inquiry.
Below is the pseudocode for calculating CLIP’s loss.
In the section under the “# scaled pairwise cosine similarities [n, n]” column, I understand that the cosine similarity matrix between image and text embeddings is calculated. Then, using the labels array, the cross-entropy is applied to the diagonal of the similarity matrix to obtain the loss.
At this point, what is the difference between loss_i and loss_t? Shouldn’t the two losses for image and text have the same value?
I understand that when calculating cross-entropy, only the values highlighted in blue are taken into account, and the rest are multiplied by zero and thus eliminated. If the loss is computed based on the diagonal values, wouldn’t loss_i and loss_t become exactly the same?
I do not understand why the loss is calculated separately for each modality and then averaged.
I am unsure about which part of the process of calculating loss value I have misunderstood…