I have a list of around 11000 different tags. I want assign relevant tags to the companies based on the description I have of them. The description is a short text of around 1000 characters that describe a company and the services they offer. I’ve made embeddings of the text and the tags and use cosine similarity to get a list of most relevant tags. This works relatively well but I noticed that there is bias for more specific (longer) tags over the more generic ones. For example:
A company might produce agricultural machinery. The following tags might be found:
‘agricultural machinery’, ‘machinery manufacturing’, ‘agricultural technology’
But there are also more generic tags in the dataset that also might be relevant to the company like ‘machinery’.
The reason I want to have these more generic tags also matched is that I expect users to use more generic tags more often, If I don’t assign these tags that often then the companies might not be found.
What would be a good approach to matching the more generic tags to the company descriptions?
7