I want to make a classifier for text, which is further use to suggest the most similar text for a one given.
The flow of the app is the following:
- extract the main 10 topics from the text, using a
llm
(it can choose from a 150 words pool) - I make the word vector a binary vector, basically working in a 150 dimensional space, where each text have a coordinate like this
[1, 0, 1, ..., 0]
- then I find the closest neighbour (I want to extend to 3-5, but for simplicity, let’s assume it is only one) using
cosine
distance - I receive the closest text
The problem is that the texts are pretty different, and the llm
gives the topics pretty well, but the suggested texts are not exactly what I was expecting. I tried to order the topics based on importance and make the vector non-binary ([10, 0, 0, 9, ..., 1]
), but that didn’t seem to help a lot.
I was wondering wheter this approach is not good for my problem, or if I should use other parameters or anything else for grouping my texts.
If you are already using LLMs this means you need a lot of compute power, so it does not seem like a good idea to me to then circle back to a simple binary vector and use that for the actual clustering, since you might have a lot of information loss from that step compared to how well the LLM actually encoded the semantics.
It would probably be much more efficient to either use something like SentenceTransformers for embedding + k-Means Clustering if you just want clusters/groups or use something like FAISS to efficiently create and perform similarity search in a vector database (a database of all embedded documents). If the latter is too much of a hassle you can also just use any library that allows you to calculate similarity metrics between vectors and apply this to the (normalized) embedded documents.
20