Sentence Tree vs. Words List

I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities.

One approach that was suggested was to keep a lookup table for all the known connector words like articles and conjunctions, remove them from the words list after splitting the sentence on the basis of the spaces. This would leave out the Name Entities in the sentence.

A lookup is then done for these identified entities on another lookup table that associates them to the entity type, for example if the sentence was: Remember the Titans was a movie directed by Boaz Yakin, the possible outputs would be:

{Remember the Titans,Movie} was {a movie,Movie} directed by {Boaz
Yakin,director}
{Remember the Titans,Movie} was a movie directed
by Boaz Yakin
{Remember the Titans,Movie} was {a movie,Movie}
directed by Boaz Yakin
{Remember the Titans,Movie} was a movie
directed by {Boaz Yakin,director}
Remember the Titans was {a
movie,Movie} directed by Boaz Yakin
Remember the Titans was {a
movie,Movie} directed by {Boaz Yakin,director}
Remember the
Titans was a movie directed by {Boaz Yakin,director}
Remember the
{the titans,Movie,Sports Team} was {a movie,Movie} directed by {Boaz
Yakin,director}
Remember the {the titans,Movie,Sports Team} was a
movie directed by Boaz Yakin
Remember the {the
titans,Movie,Sports Team} was {a movie,Movie} directed by Boaz
Yakin
Remember the {the titans,Movie,Sports Team} was a movie
directed by {Boaz Yakin,director}

The entity lookup table here would contain the following data:

Remember the Titans=>Movie
a movie=>Movie
Boaz
Yakin=>director
the Titans=>Movie
the Titans=>Sports
Team

Another alternative logic that was put forward was to build a crude sentence tree that would contain the connector words in the lookup table as parent nodes and do a lookup in the entity table for the leaf node that might contain the entities. The tree that was built for the sentence above would be:

The question I am faced with is the benefits of the two approaches, should I be going for the tree approach to represent the sentence parsing, since it provides a more semantic structure? Is there a better approach I should be going for solving it?

Unless you have an extensive background in natural language processing, I might use an existing library versus rolling your own named-entity recognizer. Even if you’re an NLP expert, a heavily-used library will likely have many more testing hours committed to it than your new recognizer. In addition, existing libraries are likely to be more flexible since they’re used in diverse projects. Be aware that some libraries have restricted licenses for commercial projects.

It’s not clear how your first approach would work since articles, conjunctions, and other “connector” words are often part of named entities. Splitting (tokenizing) the sentence on white space is also error-prone since many named entities contain more than one word (New York, Remember the Titans, For Whom the Bell Tolls…). Finally, if you’re using a lookup table to determine which non-connector words are named entities, you’ll have many false positives as shown in chapter 7.5 of Natural Language Processing with Python.

I think your second approach is a step in the right direction, though it also suffers from the aforementioned limitations. You do need a tree, but you need a tree that represents all parts of speech. As an example, the default NLTK named entity chunker gives you a rich tree:

sentence = "Remember the Titans was a movie by Boaz Yakin."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(nltk.ne_chunk(pos_tags))

# output:

(S
  Remember/NNP
  the/DT
  Titans/NNPS
  was/VBD
  a/DT
  movie/NN
  by/IN
  (PERSON Boaz/NNP Yakin/NNP)
  ./.)

Of course, NLTK’s default named entity chunker missed “Remember the Titans” as a single named entity (though it did recognize “Boaz Yakin”). This brings up the hardest part of named entity recognition. Even with a rich and well-tested library, you’ll need to train your named entity recognizer with data that best fits your data. (NLTK’s default named entity chunker is trained on the ACE corpus.) If you’re interested, Matt Johnson takes a closer look at the default NLTK named entity chunker in Lifting the hood on NLTK’s NE chunker.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 16:12

Thẻ: natural-language-processing, parsing, text-processing

Sentence Tree vs. Words List

I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities.

{Remember the Titans,Movie} was {a movie,Movie} directed by {Boaz
Yakin,director}
{Remember the Titans,Movie} was a movie directed
by Boaz Yakin
{Remember the Titans,Movie} was {a movie,Movie}
directed by Boaz Yakin
{Remember the Titans,Movie} was a movie
directed by {Boaz Yakin,director}
Remember the Titans was {a
movie,Movie} directed by Boaz Yakin
Remember the Titans was {a
movie,Movie} directed by {Boaz Yakin,director}
Remember the
Titans was a movie directed by {Boaz Yakin,director}
Remember the
{the titans,Movie,Sports Team} was {a movie,Movie} directed by {Boaz
Yakin,director}
Remember the {the titans,Movie,Sports Team} was a
movie directed by Boaz Yakin
Remember the {the
titans,Movie,Sports Team} was {a movie,Movie} directed by Boaz
Yakin
Remember the {the titans,Movie,Sports Team} was a movie
directed by {Boaz Yakin,director}

The entity lookup table here would contain the following data:

Remember the Titans=>Movie
a movie=>Movie
Boaz
Yakin=>director
the Titans=>Movie
the Titans=>Sports
Team

sentence = "Remember the Titans was a movie by Boaz Yakin."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(nltk.ne_chunk(pos_tags))

# output:

(S
  Remember/NNP
  the/DT
  Titans/NNPS
  was/VBD
  a/DT
  movie/NN
  by/IN
  (PERSON Boaz/NNP Yakin/NNP)
  ./.)

Filed under: softwareengineering - @ 16:12

Thẻ: natural-language-processing, parsing, text-processing

Thiết kế website giá rẻ

Danh mục

Sentence Tree vs. Words List

Sentence Tree vs. Words List