Algorithm/research on detecting language of text [closed]

I am interested in finding an approach that will detect what language a string of text is. As Google translate does.

The first challenge in identifying the language is to narrow it down to something that one can handle. Each language has its own letter frequency list.

English: etao insr hldc umfp gwyb vkxj qz
French:  esai tnru lodc mpév qfbg hjàx èyêz çôùâ ûîœw kïëü æñ
German:  enis ratd hulc gmob wfkz vüpä ßjöy qx
Spanish: eaos rnid lctu mpbg yívq óhfz jéáñ xúüw k

(specifics from http://www.letterfrequency.org and wikipedia: Letter frequency)

Using this information one can identify if certain characters appear and rapidly cut down the choices. If a ß, well, thats likely to be German. There are certain characters that only appear in certain languages. This, however is not foolproof as one could be talking about classic heavy metal bands that have like to use characters outside of the normal for that language – Mötley Crüe (see Metal umlaut) or use of borrowed words (some people write résumé in English).

This is where the multiple steps are used:

Validate likely languages through character set
Compare letter frequency to languages
Compare specific words to a known dictionary for the language

In comparing letter frequency analysis, one should maintain both an accented and unaccented set for situations where someone writes the language with unaccented latin characters rather than making full use of the character set.

The combination of this information is then sent through statistical processes to identify the appropriate guess of the language (and yes, I am completely glossing over this section because my statistical math is weak and it would go quite beyond the basics). More about this in Language identification: Statistical approaches – this link on Wikipedia goes to a number of articles and libraries for such.

The tools use a mix of routines to determine the language of a string of text.

In some cases the existence of certain specific characters increase the likelihood of specific languages: letters with accents and umlauts. In some cases it is specific words such as those language specific definite articles: der, die, das in German. In other cases it is the character types used that identifies the language family: Chinese, Arabic…

There are a number of open source and paid libraries available. They are based on ingesting a large corpus of example documents to help with the probabilities. Some allow you to add additional rules if you have specific cases that cause problems.

The longer the string you want to identify the better the results. A one word string may fit into multiple languages. A paragraph will get better results. Of course if the document has multiple languages in the different sections you can get unexpected results.

A simple approach could be to use word lists (hash-maps) of the most common words for every language you want to support.

When you parse the text, you take each word and check if it can be found in one of these lists. The word-list with the most hits is very likely the one of the language the text is written in. (check them all for every word – some words appear in more than one language, sometimes meaning something completely different).

The shorter the texts you want to analyze, the longer the word-lists will have to be to reliably identify the language.

Actually, the easiest way is to use the google/ms translate api, they’ve done all the hard work for you.

Here’s a webservice that you can use from Microsoft Translate:

http://msdn.microsoft.com/en-us/library/ff512411.aspx

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 12:34

Thẻ: algorithms, libraries

Algorithm/research on detecting language of text [closed]

I am interested in finding an approach that will detect what language a string of text is. As Google translate does.

The first challenge in identifying the language is to narrow it down to something that one can handle. Each language has its own letter frequency list.

English: etao insr hldc umfp gwyb vkxj qz
French:  esai tnru lodc mpév qfbg hjàx èyêz çôùâ ûîœw kïëü æñ
German:  enis ratd hulc gmob wfkz vüpä ßjöy qx
Spanish: eaos rnid lctu mpbg yívq óhfz jéáñ xúüw k

(specifics from http://www.letterfrequency.org and wikipedia: Letter frequency)

This is where the multiple steps are used:

Validate likely languages through character set
Compare letter frequency to languages
Compare specific words to a known dictionary for the language

The tools use a mix of routines to determine the language of a string of text.

A simple approach could be to use word lists (hash-maps) of the most common words for every language you want to support.

The shorter the texts you want to analyze, the longer the word-lists will have to be to reliably identify the language.

Actually, the easiest way is to use the google/ms translate api, they’ve done all the hard work for you.

Here’s a webservice that you can use from Microsoft Translate:

http://msdn.microsoft.com/en-us/library/ff512411.aspx

Filed under: softwareengineering - @ 12:34

Thẻ: algorithms, libraries

Thiết kế website giá rẻ

Danh mục

Algorithm/research on detecting language of text [closed]

Algorithm/research on detecting language of text [closed]