I’m looking for some input and theory on how to approach a lexical topic.
Let’s say I have a collection of strings, which may just be one sentence or potentially multiple sentences. I’d like to parse these strings to and rip out the most important words, perhaps with a score that denotes how likely the word is to be important.
Let’s look at a few examples of what I mean.
Example #1:
“I really want a Keurig, but I can’t afford one!”
This is a very basic example, just one sentence. As a human, I can easily see that “Keurig” is the most important word here. Also, “afford” is relatively important, though it’s clearly not the primary point of the sentence. The word “I” appears twice, but it is not important at all since it doesn’t really tell us any information. I might expect to see a hash of word/scores something like this:
"Keurig" => 0.9
"afford" => 0.4
"want" => 0.2
"really" => 0.1
etc...
Example #2:
“Just had one of the best swimming practices of my life. Hopefully I can maintain my times come the competition. If only I had remembered to take of my non-waterproof watch.”
This example has multiple sentences, so there will be more important words throughout. Without repeating the point exercise from example #1, I would probably expect to see two or three really important words come out of this: “swimming” (or “swimming practice”), “competition”, & “watch” (or “waterproof watch” or “non-waterproof watch” depending on how the hyphen is handled).
Given a couple examples like this, how would you go about doing something similar? Are there any existing (open source) libraries or algorithms in programming that already do this?
2
There are definitely people thinking about the problem you describe. João Ventura and Joaquim Ferreira da Silva’s Ranking and Extraction of Relevant Single Words in Text (pdf) is a nice introduction to existing ranking techniques as well as suggestions for improvement. All techniques they describe rely on a corpus (lots of text) versus one or two lines of text. Your corpus would have to be the collection of all samples or possibly many corpora of collected samples from specific sources. Keep in mind that single word (unigram) relevance is very much an unsolved problem. As the paper describes:
“…using purely statistical methods, this kind of classification
isn’t always straightforward or even exact because, although the
notion of relevance is a concept easy to understand, normally there’s
no consensus about the frontier that separates relevance from
non-relevance. For instance, words like “Republic” or “London” have
significative relevance and words like “or” and “since” have no
relevance at all, but what about words like “read”, “terminate” and
“next”? These kind of words are problematic because usually there’s no
consensus about their semantic value.”
There are many open source natural language processing toolkits. (Be careful. Some tools are free for research but require a commercial license for commercial use.) They’ll make your life easier regardless of the approach you choose.
I’m most familiar with the Natural Language Toolkit (NLTK). It’s easy to use, well-documented, and is featured in the book, Natural Language Processing with Python (freely available online). As a simple example of what NLTK might do for you, imagine using its part-of-speech tagger. With each word’s part-of-speech identified, you might consider proper nouns very important and adjectives less so. Verbs might be important and adverbs less so. It’s by no means a state-of-the-art ranking, but you get useful information with little effort. When you’re ready to move on to more sophisticated analysis, NLTK’s built-in ability to tokenize, tag, chunk, and classify will let you focus on the other details of your solution.
Natural language processing is its own discipline with quite a lot of formal research done on it. I would start by looking there.
I would also reconsider my needs. Even after 50+ years of research, the best computer scientists have been able to come up with is Siri. I would not expect a computer to successfully do what you’re talking about with regularity.
If there are certain limitations to the speech (like Siri assuming you have a simple command or question) it can be better. Reconsidering my needs (assuming I do need NLP) would include defining my limitations. After that I would likely hunt for a ton of examples. Partly to test anything I come up with, but many modern solutions involve machine learning. I’d need those examples as input to the learning curve.
So in summary, I seriously doubt anything will be able to give you good scores in this sort of context free scenario.
1