I deployed a project some days ago that allow to extract some medical articles using the results of a questionnaire completed by a user. For instance, if I reply on questionnaire I’m affected by Diabetes type 2 and I’m a smoker, my algorithm extracts all articles related to diabetes bubbling up all articles contains information about Diabetes type 2 and smoking. Basically we created a list of topic and, for every topic we define a kind of “guideline” that allows to extract and order informations for a user.
I’m quite sure there are some better way to put on relationship two content but I was not able to find them on network. Could you suggest my a model, algorithm or paper to better understand this kind of problem and that helps me to find a faster, and more accurate way to extract information for an user?
This is a perfect application for a full-text indexer such as Lucene.
Let’s say your questionnaire asks about three things: smoking, diabetes and obesity. Once the text of the articles is indexed, you can use the answers you get to form queries that will return the most relevant articles first.
So, for example, the query for an overweight, non-diabetic smoker might be:
obesity smoking
+obesity +smoking
to return only articles that explicitly mention both+obesity +smoking -diabetes
to make sure selected articles mention both and do not mention diabetes
Your results can be further enhanced by using a query expander like WordNet that can add synonyms to the query (e.g. expanding diabetes
to include related words like neuropathy
, retinopathy
and insulin
) and make articles containing those words more relevant.
I’ve built several systems that put full-text indexers to unusual use and have found that they provide a lot of query flexibility with a minimum of development and give very good results.