I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities.
One approach that was suggested was to keep a lookup table for all the known connector words like articles and conjunctions, remove them from the words list after splitting the sentence on the basis of the spaces. This would leave out the Name Entities in the sentence.
A lookup is then done for these identified entities on another lookup table that associates them to the entity type, for example if the sentence was: Remember the Titans was a movie directed by Boaz Yakin
, the possible outputs would be:
{Remember the Titans,Movie} was {a movie,Movie} directed by {Boaz
Yakin,director}
{Remember the Titans,Movie} was a movie directed
by Boaz Yakin
{Remember the Titans,Movie} was {a movie,Movie}
directed by Boaz Yakin
{Remember the Titans,Movie} was a movie
directed by {Boaz Yakin,director}
Remember the Titans was {a
movie,Movie} directed by Boaz Yakin
Remember the Titans was {a
movie,Movie} directed by {Boaz Yakin,director}
Remember the
Titans was a movie directed by {Boaz Yakin,director}
Remember the
{the titans,Movie,Sports Team} was {a movie,Movie} directed by {Boaz
Yakin,director}
Remember the {the titans,Movie,Sports Team} was a
movie directed by Boaz Yakin
Remember the {the
titans,Movie,Sports Team} was {a movie,Movie} directed by Boaz
Yakin
Remember the {the titans,Movie,Sports Team} was a movie
directed by {Boaz Yakin,director}
The entity lookup table here would contain the following data:
Remember the Titans=>Movie
a movie=>Movie
Boaz
Yakin=>director
the Titans=>Movie
the Titans=>Sports
Team
Another alternative logic that was put forward was to build a crude sentence tree that would contain the connector words in the lookup table as parent nodes and do a lookup in the entity table for the leaf node that might contain the entities. The tree that was built for the sentence above would be:
The question I am faced with is the benefits of the two approaches, should I be going for the tree approach to represent the sentence parsing, since it provides a more semantic structure? Is there a better approach I should be going for solving it?
Unless you have an extensive background in natural language processing, I might use an existing library versus rolling your own named-entity recognizer. Even if you’re an NLP expert, a heavily-used library will likely have many more testing hours committed to it than your new recognizer. In addition, existing libraries are likely to be more flexible since they’re used in diverse projects. Be aware that some libraries have restricted licenses for commercial projects.
It’s not clear how your first approach would work since articles, conjunctions, and other “connector” words are often part of named entities. Splitting (tokenizing) the sentence on white space is also error-prone since many named entities contain more than one word (New York, Remember the Titans, For Whom the Bell Tolls…). Finally, if you’re using a lookup table to determine which non-connector words are named entities, you’ll have many false positives as shown in chapter 7.5 of Natural Language Processing with Python.
I think your second approach is a step in the right direction, though it also suffers from the aforementioned limitations. You do need a tree, but you need a tree that represents all parts of speech. As an example, the default NLTK named entity chunker gives you a rich tree:
sentence = "Remember the Titans was a movie by Boaz Yakin."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(nltk.ne_chunk(pos_tags))
# output:
(S
Remember/NNP
the/DT
Titans/NNPS
was/VBD
a/DT
movie/NN
by/IN
(PERSON Boaz/NNP Yakin/NNP)
./.)
Of course, NLTK’s default named entity chunker missed “Remember the Titans” as a single named entity (though it did recognize “Boaz Yakin”). This brings up the hardest part of named entity recognition. Even with a rich and well-tested library, you’ll need to train your named entity recognizer with data that best fits your data. (NLTK’s default named entity chunker is trained on the ACE corpus.) If you’re interested, Matt Johnson takes a closer look at the default NLTK named entity chunker in Lifting the hood on NLTK’s NE chunker.