I am using spacy
to train a custom textcat_multilabel
classification model (in GPU accuracy mode with pytorch
GPU initializer) to classify some text into a set of 17 classes.
I am generating the config for the text multilabel classification with:
python -m spacy init config /path/to/config.cfg --pipeline textcat_multilabel --optimize accuracy --force --gpu
And the model is trained with:
python -m spacy train /path/to/config.cfg --paths.train /path/to/data/train.spacy --paths.dev /path/to/data/test.spacy --output /path/to/output --gpu-id 0
One issue I am hitting is the text I am trying to classify can sometimes contain non-English words, such as “arraigo” which do not have vectors in the static vectors used for training. So if I have a sentence using “arraigo” or similar these do not get classified well because the word vectors are, I believe, all zeros.
I have found that you can use custom word vectors when training by adding --paths.vectors
and a path to the vectors, however I could not find any information for how to generate custom vectors for very specific words such as “arraigo”. Any recommendation?