I’m wondering if there is general advice for the smartest way to approach this problem.
I am using word2vec to determine similarity scores (that is the final output I am interested in) between specific words – some of these are single tokens but others should be bigrams. To complicate matters, I am using tensorflow (in order to learn how to work with tensorflow).
I want to keep bigrams that are found in a separate list:
Bigram_list = ["northern lights", "cloud cover", "table leg",...]
At the moment, the process should look something like this:
- Identify bigrams in corpus (using nltk collocations)
- create
identified_bigrams_list = ["northern lights", "cloud cover", "banana peel",...]
- search through
identified_bigrams_list
for matches inBigram_list
- PROBLEM: replace matches in the corpus with “_”, so for example “northern_light”, “cloud_cover”. I’ve tried with a dictionary of the Bigram_list (e.g.
"northern lights": "northern_lights"
). So I am trying to put it back into the corpus so it will be considered a single token and processed as a single embedding
Even if I can get this to work, this seems computationally inefficient, especially once I move on to a larger corpus for the actual training (currently using a tiny corpus in order to get this to work).
Any advice?
Here’s a streamlined approach to handle your task more efficiently while using TensorFlow:
- Identify Bigrams in Corpus: Use NLTK or another library to identify
bigrams in your corpus. - Replace Bigrams in Corpus: Replace identified bigrams with a single
token format (e.g., “northern_lights”). - Train Word2Vec: Use the modified corpus to train your Word2Vec
model. - Calculate Similarity Scores: Calculate similarity scores between
specific words or bigrams.
Implementation:
Use NLTK to identify bigrams in your corpus:
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def identify_bigrams(corpus):
tokens = nltk.word_tokenize(corpus)
finder = BigramCollocationFinder.from_words(tokens)
bigrams = finder.nbest(BigramAssocMeasures.likelihood_ratio, 50)
return [' '.join(bigram) for bigram in bigrams]
Next, you need to replace the identified bigrams with a single token format:
def replace_bigrams(corpus, bigram_list):
tokens = nltk.word_tokenize(corpus)
bigram_dict = {bigram: bigram.replace(' ', '_') for bigram in bigram_list}
new_tokens = []
skip_next = False
for i in range(len(tokens) - 1):
if skip_next:
skip_next = False
continue
bigram = tokens[i] + ' ' + tokens[i + 1]
if bigram in bigram_dict:
new_tokens.append(bigram_dict[bigram])
skip_next = True
else:
new_tokens.append(tokens[i])
if not skip_next:
new_tokens.append(tokens[-1])
return ' '.join(new_tokens)
corpus = "The northern lights are amazing. I love the table leg design."
bigram_list = ["northern lights", "table leg"]
new_corpus = replace_bigrams(corpus, bigram_list)
print(new_corpus) # Output: "The northern_lights are amazing. I love the table_leg design."
Then you train the Word2Vec Model
Ali Rammal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.