I am new to this field, so maybe I am misunderstanding something. However, I thought you can use BERT embeddings to determine semantic similarity. I was trying to group some words in categories using this, but the results were very bad.
E.g. here is a small example with animals and fruits. Notice that the highest similarity is between cat and banana??
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True).eval()
def gen_embedding(word):
encoding = tokenizer(word, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoding)
token_embeddings = outputs.last_hidden_state.squeeze()
token_embeddings = token_embeddings[1 : -1]
word_embedding = token_embeddings.mean(dim=0)
return word_embedding
words = [
'cat',
'seagull',
'mango',
'banana'
]
embs = [gen_embedding(word) for word in words]
print(cosine_similarity(embs))
# array([[1. , 0.33929926, 0.7086487 , 0.79372996],
# [0.33929926, 1.0000001 , 0.29915804, 0.4000572 ],
# [0.7086487 , 0.29915804, 1. , 0.7659105 ],
# [0.79372996, 0.4000572 , 0.7659105 , 0.99999976]], dtype=float32)
Am I doing something wrong?