Matching up articles with similar ones

I am creating a site where people can write on a niche topic. It is almost like a blog, however the area as I mentioned is a small niche with (hopefully) passionate users.

I want a functionality where once someone posts his article. People with similar experience can be notified, so that they can read about. Now my question is – how to determine similar articles? I know that tagging is a way – just like here in stackexchange and I will be implementing it. But suppose people do not tag, or tag incorrectly, the whole user experience will be hard to indulge in.

Does anyone have any pointers on how to match up articles other than the tagging method?

2

I am currently undertaking a similar (although more generic) project with my lab. As such, I want to warn you that this feature is a rabbit hole that can get very complicated very quickly. The first thing you need to do is think about your users and your goals and decide what’s “good enough” or you will spend a lot of time developing a feature that, in the grand scheme of your site, might not be that important.

Basically you want some sort of information retrieval system. Think a mini-Google but not nearly as complex. First you need to decide how you will define similarity between articles (a metric). This will be handled in your preprocessing. Generally your actually comparison will be the same no matter what your metric (typically cossine similarity).

Defining a Metric

First, you need to decide what makes articles similar. There are two main approaches: looking for similarities in article topics or looking for similarities in article text. Topics will give better results but text is easier to implement.

Similarity by Topic

As mentioned several times, the easiest way to implement this system is allowing specify topics through author-specified tags. You would then search for articles with large overlaps in tags. If the tags are numerous and fine grained enough then this should give the best results.

The disadvantage is that you need to put a lot of thought into what the tags are to ensure you have coverage, clarity, and a lack of redundancy. If you take the Stack Exchange approach of letting users create their own tags then you can increase coverage but you need to moderate the tags to maintain the clarity/lack of redundancy. However, the greatest drawback of this approach is that you are trusting users to appropriately tag their posts. SE gets around this problem by letting other users edit and make suggestions for the tags.

You can get even better results if you tag topics at the sentence or paragraph level. It gives a better representation of which topics are more important in an article but it’s more work. As the tagging scope gets smaller, the complexity of this task becomes exponentially more difficult.

What about an automated solution to take the work load off the users? Automatic Topic Identification is something that has been studied a lot. I’m not an expert at it but I suggest you read a few papers and decide if you feel these solutions are mature enough to give reliable results. My concern with this approach is that since you admit your domain is niche you might have a hard time finding an out-of-the-box solution and will need to implement the topic identifier yourself. At that point you might as well just do text-based similarity because it will be much easier and out-of-the-box solutions exist.

Similarity by Text

In this approach instead of comparing topic tags, you compare the actual words in the article. The advantage is that the preprocessing is much easier to accomplish. The disadvantage is that it assumes that similar text means a similar topic, which is not always the case.

Making it Work

In general, whichever metric you choose you will end up with a vector representing your articles. Maybe the vector is of word frequencies or of topic tags. You now need to compare the vectors for your articles to see which are similar.

The Stanford Natural Language Processing Course offered on coursera.com is a good introduction to Information Retrieval (specifically the Week 7 lectures). Keep in mind that the solutions presented in those lectures are relatively basic, but it’s a good start.

I would heavily suggest trying to find an out-of-the-box implementation here. Failing that, using a toolkit like Apache Lucene will greatly simplify your development.

Now you need to test out a bunch of term weighting algorithms and see which one gives the best results for your data. TREC is a competition to find better and better weighting algorithms. Check the proceedings on their website to find discussions of newer, more accurate weighting algorithms.

I’ve got a few ideas:

A) You could pre-define a selection of categories that each poster is required to select, with either one category or multiple categories per post. This list of categories would have to be rather comprehensive, but could be edited over time.

B) You could implement a system along with poster tagging to allow users to suggest tags that should be added. As tags are suggested, the poster would be notified, and could easily approve or deny tags as necessary.

C) A hybrid approach: implement a tagging system, but allow users to subscribe to a collection of tags, allowing each user to customize categories that may require an article to have multiple tags in order to qualify.

I don’t think you should worry too much about users not tagging articles. There is an incentive for them to tag correctly if they know that the tagging system is what attracts readers (and therefore feedback and future fame).

Maybe you should do your own indexing of the articles and add metadata to your articles in order to facilitate search and grouping of categories.

I implemented such a system together with some colleagues, we used Lucene.net to create a green office environment. This would be useful to index your articles, you could create more searches based on the documents data. There are many indexing systems which create meta data and tags widely available on the internet. I found this one link please try this one.

1

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật