I am trying to get the contextual entropy of each word within sentences of a dataset.
I am following the definition of Contextual Entropy provided in this paper, so for each position within a sentence, I compute the probability of each word of the vocabulary given the sentence (left) context, and then I sum them. This process is being very time-consuming though, so I need advice on alternative ways of computing the CE.
So far my script looks roughly like this:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-chinese")
pipe = pipeline("fill-mask", model="google-bert/bert-base-chinese")
BERTvocab = tokenizer.get_vocab()
for sentence in all_sentences:
for i,word in enumerate(sentence):
# for each word the context is the left context + [MASK]
context = sentence[:i+1]
context = context+'[MASK]'
for vw in BERTvocab:
# get the probability distribution of each word in the [MASK] position
teres = pipe(context, targets= [vw])
softtyent = teres[0]["score"]
# compute the p(w|context)*log(p(w|context))
temp_tyent = softtyent*math.log(softtyent)
singles.append(temp_tyent)
entropy = -sum(singles)
My questions are:
- Is my understanding of the Contextual Entropy right?
- Considering that the dataset has 400 sentences long 15 words on average, is there a faster and computationally “more elegant” way to do this? It’s taking me ages to run the script…
Thanks!
ElleS is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.