Here’s the explanation of what i want to fully make (its a project for school).
- A user just puts in a file with just the answers to whatever question was asked in the survey.
2.The machine finds similar answers and groups them under one label or cluster that is unnamed (thinking of using MeanShift, GMM, KMeans)
- If possible I would also like it to generate a label for the cluster.
4.Write the clustered and labeled answers back into a file to be checked and used for whatever purpose.
Some context on the data: lots of short (with some long, 10+ words) answers like “i dont know”, “??”, “helpful”, “red”, etc. and each has anywhere from 200 to 2000 answers.The answers are in dutch or french, would it be recomanded I translate them to english for better performance? Usually there are around 7 to 20 (high amounts are rare) clusters. I also have the right labels for the answers so I can check if the algorithm has clustered correctly.
I have tried looking into it and I need to vectorize my texts first, for this I tried TF-IDF and Count vectorizer in scikit. I also found their cheat sheet and It recommands i use MeanShift.
I havent tried looking for the best parameters yet, but the performance seem very poor (close to random). I used Adjusted Rand Index, Normalized Mutual Information and Silhouette Score to evbaluate.
Am I on the right track or are there better things out there? Vectorization methods, embeddings, clustering algorithms?
Shimz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.