I am working on an NLP project for sentiment analysis. I am using SpaCy to tokenize sentences. As I was reading the documentation, I learned about NER. I’ve read that it can be used to extract entities from text for aiding a user’s searching.
The thing I am trying to understand is how to embody it (if I should) in my tokenization process. I am giving an example.
text = "Let's not forget that Apple Pay in 2014 required a brand new iPhone in order to use it. A significant portion of Apple's user base wasn't able to use it even if they wanted to. As each successive iPhone incorporated the technology and older iPhones were replaced the number of people who could use the technology increased."
sentence = sp(text) # sp = spacy.load('en_core_web_sm')
for word in sen:
print(word.text)
# Let
# 's
# not
# forget
# that
# Apple
# Pay
# in
# etc...
for word in sentence .ents:
print(word.text + " _ " + word.label_ + " _ " + str(spacy.explain(word.label_)))
# Apple Pay _ ORG _ Companies, agencies, institutions, etc.
# 2014 _ DATE _ Absolute or relative dates or periods
# iPhone _ ORG _ Companies, agencies, institutions, etc.
# Apple _ ORG _ Companies, agencies, institutions, etc.
# iPhones _ ORG _ Companies, agencies, institutions, etc.
The first loops shows that ‘Apple’ and ‘Pay’ are different tokens. When printing the discovered entities in the second loop, it understands that ‘Apply Pay’ is an ORG. If yes, how could I achieve that (let’s say) “type” of tokenization?
My thinking is, shouldn’t ‘Apple’ and ‘Pay’ be tokenized as a single word together so that, when I create my classifier it will recognize it as an entity and not recognize a fruit (‘Apple’) and a verb (‘Pay’).
Thanks in advance.