I’m a psychologist, and I’m diving into the field of AI. I could really use some help for a project. This semester, I discovered Word2Vec and was mesmerized by its capability to find associations. So, I decided to try it on an artificial corpus of psychotherapy documents created by ChatGPT. However, I have a practical issue regarding the size of the vocabulary. Psychotherapy reports from an analyst don’t exceed 10,000 words, and I know that cleaning this corpus typically reduces it by at least half.
I would appreciate advice on software that can maximize results with minimal data. My Word2Vec models have shown inconsistent outcomes. Specifically, when I retrain my program, it produces very different results..
I’ve been using Word2Vec on a full corpus of around 11,000 words. After preprocessing, it drops to 5,000 words. I’ve also been using PCA, which is where I became unsure about my results because the model was constantly changing.
Vinicius Fantini Marques Roja is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.