Language parsing to find important words

I’m looking for some input and theory on how to approach a lexical topic.

Let’s say I have a collection of strings, which may just be one sentence or potentially multiple sentences. I’d like to parse these strings to and rip out the most important words, perhaps with a score that denotes how likely the word is to be important.

Let’s look at a few examples of what I mean.

Example #1:

“I really want a Keurig, but I can’t afford one!”

This is a very basic example, just one sentence. As a human, I can easily see that “Keurig” is the most important word here. Also, “afford” is relatively important, though it’s clearly not the primary point of the sentence. The word “I” appears twice, but it is not important at all since it doesn’t really tell us any information. I might expect to see a hash of word/scores something like this:

"Keurig" => 0.9
"afford" => 0.4
"want"   => 0.2
"really" => 0.1
etc...

Example #2:

“Just had one of the best swimming practices of my life. Hopefully I can maintain my times come the competition. If only I had remembered to take of my non-waterproof watch.”

This example has multiple sentences, so there will be more important words throughout. Without repeating the point exercise from example #1, I would probably expect to see two or three really important words come out of this: “swimming” (or “swimming practice”), “competition”, & “watch” (or “waterproof watch” or “non-waterproof watch” depending on how the hyphen is handled).

Given a couple examples like this, how would you go about doing something similar? Are there any existing (open source) libraries or algorithms in programming that already do this?

There are definitely people thinking about the problem you describe. João Ventura and Joaquim Ferreira da Silva’s Ranking and Extraction of Relevant Single Words in Text (pdf) is a nice introduction to existing ranking techniques as well as suggestions for improvement. All techniques they describe rely on a corpus (lots of text) versus one or two lines of text. Your corpus would have to be the collection of all samples or possibly many corpora of collected samples from specific sources. Keep in mind that single word (unigram) relevance is very much an unsolved problem. As the paper describes:

“…using purely statistical methods, this kind of classification
isn’t always straightforward or even exact because, although the
notion of relevance is a concept easy to understand, normally there’s
no consensus about the frontier that separates relevance from
non-relevance. For instance, words like “Republic” or “London” have
significative relevance and words like “or” and “since” have no
relevance at all, but what about words like “read”, “terminate” and
“next”? These kind of words are problematic because usually there’s no
consensus about their semantic value.”

There are many open source natural language processing toolkits. (Be careful. Some tools are free for research but require a commercial license for commercial use.) They’ll make your life easier regardless of the approach you choose.

I’m most familiar with the Natural Language Toolkit (NLTK). It’s easy to use, well-documented, and is featured in the book, Natural Language Processing with Python (freely available online). As a simple example of what NLTK might do for you, imagine using its part-of-speech tagger. With each word’s part-of-speech identified, you might consider proper nouns very important and adjectives less so. Verbs might be important and adverbs less so. It’s by no means a state-of-the-art ranking, but you get useful information with little effort. When you’re ready to move on to more sophisticated analysis, NLTK’s built-in ability to tokenize, tag, chunk, and classify will let you focus on the other details of your solution.

Natural language processing is its own discipline with quite a lot of formal research done on it. I would start by looking there.

I would also reconsider my needs. Even after 50+ years of research, the best computer scientists have been able to come up with is Siri. I would not expect a computer to successfully do what you’re talking about with regularity.

If there are certain limitations to the speech (like Siri assuming you have a simple command or question) it can be better. Reconsidering my needs (assuming I do need NLP) would include defining my limitations. After that I would likely hunt for a ton of examples. Partly to test anything I come up with, but many modern solutions involve machine learning. I’d need those examples as input to the learning curve.

So in summary, I seriously doubt anything will be able to give you good scores in this sort of context free scenario.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 13:06

Thẻ: languages, parsing

Thiết kế website giá rẻ

Danh mục

Language parsing to find important words