I am interested in finding an approach that will detect what language a string of text is. As Google translate does.
4
The first challenge in identifying the language is to narrow it down to something that one can handle. Each language has its own letter frequency list.
English: etao insr hldc umfp gwyb vkxj qz French: esai tnru lodc mpév qfbg hjàx èyêz çôùâ ûîœw kïëü æñ German: enis ratd hulc gmob wfkz vüpä ßjöy qx Spanish: eaos rnid lctu mpbg yívq óhfz jéáñ xúüw k
(specifics from http://www.letterfrequency.org and wikipedia: Letter frequency)
Using this information one can identify if certain characters appear and rapidly cut down the choices. If a ß
, well, thats likely to be German. There are certain characters that only appear in certain languages. This, however is not foolproof as one could be talking about classic heavy metal bands that have like to use characters outside of the normal for that language – Mötley Crüe (see Metal umlaut) or use of borrowed words (some people write résumé in English).
This is where the multiple steps are used:
- Validate likely languages through character set
- Compare letter frequency to languages
- Compare specific words to a known dictionary for the language
In comparing letter frequency analysis, one should maintain both an accented and unaccented set for situations where someone writes the language with unaccented latin characters rather than making full use of the character set.
The combination of this information is then sent through statistical processes to identify the appropriate guess of the language (and yes, I am completely glossing over this section because my statistical math is weak and it would go quite beyond the basics). More about this in Language identification: Statistical approaches – this link on Wikipedia goes to a number of articles and libraries for such.
The tools use a mix of routines to determine the language of a string of text.
In some cases the existence of certain specific characters increase the likelihood of specific languages: letters with accents and umlauts. In some cases it is specific words such as those language specific definite articles: der, die, das in German. In other cases it is the character types used that identifies the language family: Chinese, Arabic…
There are a number of open source and paid libraries available. They are based on ingesting a large corpus of example documents to help with the probabilities. Some allow you to add additional rules if you have specific cases that cause problems.
The longer the string you want to identify the better the results. A one word string may fit into multiple languages. A paragraph will get better results. Of course if the document has multiple languages in the different sections you can get unexpected results.
A simple approach could be to use word lists (hash-maps) of the most common words for every language you want to support.
When you parse the text, you take each word and check if it can be found in one of these lists. The word-list with the most hits is very likely the one of the language the text is written in. (check them all for every word – some words appear in more than one language, sometimes meaning something completely different).
The shorter the texts you want to analyze, the longer the word-lists will have to be to reliably identify the language.
Actually, the easiest way is to use the google/ms translate api, they’ve done all the hard work for you.
Here’s a webservice that you can use from Microsoft Translate:
http://msdn.microsoft.com/en-us/library/ff512411.aspx