Word recognition in a string without spaces or punctuation marks

I have a small C# project that reads a file and gives me an output: a string that does not contain spaces nor any types of punctuation marks. It may also contain a few misspellings.

Ex.
Output:
THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

I wonder if there is a way to analyze this string by using text mining/data mining and/or regular expressions to indentify the words (preferably nounds, verbs and so forth.) in the string?

I want to read a bunch of files giving me different outputs and put them in statistical order from the one with the most found words to the one which only contain a string of mumbo jumbo.

Also, if the string contain misspellings like:
THEQUICGBROWNFOSJUMPSOVERTHHLAZYDOG
I know that regular expression can calculate the “distans” from a misspelled word and find the most matching one (using a corpus and probability) but this might prove more of a challenge as the string does not have any spaces or punctuation marks.
Any ideas how I can solve this?

Here is the general approach:

Read a dictionary file and organize all words in a trie data structure. Many Unix systems have such files in the /usr/share/dict/ directory.
Find possible matches of a prefix of your input in the trie. This will usually produce multiple matches, for example theologyisabout begins with theology and the.
If we remove the matched prefixes, we get a set of possible continuations, on which we repeat step 2.

We then end up with a vast tree of possible interpretations.

There are two problems with this:

there will be an exponential amount of interpretations
we might miss interpretations because of an unknown word, or some unknown grammatical form

We can solve both of these problems by fuzzy matching. When we look up prefixes in the trie, we allow letters to be missing, inserted or changed. However, each such aberration increases the Levenshtein distance. If one interpretation has a too high summed Levenshtein distance, we can prune that interpretation and concentrate on other branches. You could also keep the branches in a priority queue and always investigate the branches with the lowest current edit distance, which is most likely to be a sensible interpretation – not unlike Dijkstra’s pathfinding algorithm.

Note that multiple prefix sequences with different edit distances might lead to the same remaining string. You can therefore keep your progress in a data structure that allows parts to be shared. This caching will likely be beneficial for performance. If you in fact try to implement a variant of Dijkstra’s algorithm here, a known tail would correspond to a visited node in the graph.

The difficult part is how to actually perform the fuzzy matching. E.g. you could decide on a maximum edit density of x edits per character (0 <= x <= 1), and abort an interpretation if it is guaranteed that this interpretation will have a higher density. For a given string with length l we can therefore determine an edit budget b = x · l. This budget is less important when matching prefixes in the trie, but this trie is only useful if there are less edits than characters in the prefix. An edit budget like b = floor(c / 2) with a prefix of length c might be sensible. How much edits you allow is not only a metric for how garbled texts you allow your system to “understand”, but also a performance setting – smaller budgets run faster, as less alternatives have to be investigated.

Thanks to amon I managed to get the algorithm going!

By using this here code a Trie was implemented and filled with the english dictionary (around 23600 words).

By starting reading from each index in the string, feeding it the next char and then the next until the trie does no longer find any more possible solutions (misspelled word or ending of a real one plus the start of the next), judging this outcome and increase the index by 1 words can be found and analyzed.
V
THEQUICKBROWNFOX… Finds THE

_V
THEQUICKBROWNFOX… Finds HE

__V
THEQUICKBROWNFOX… Finds EQ
and so forth.

In this sequence it is possible to check the weigted edit distance between words and find misspellings. However, due to a lack of time this was never fully implemented in my project.
My project has a more advanced approach to this as it is a statistics tool to run sevreal iterations on a given set of texts so feel free to ask if you have more specific questions and I will answer by the best of my capabilities.

Thank you for all the help in this!

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 07:02

Thẻ: c++, data-mining, regular-expressions, statistics, strings

Word recognition in a string without spaces or punctuation marks

I have a small C# project that reads a file and gives me an output: a string that does not contain spaces nor any types of punctuation marks. It may also contain a few misspellings.

Ex.
Output:
THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

I wonder if there is a way to analyze this string by using text mining/data mining and/or regular expressions to indentify the words (preferably nounds, verbs and so forth.) in the string?

I want to read a bunch of files giving me different outputs and put them in statistical order from the one with the most found words to the one which only contain a string of mumbo jumbo.

Here is the general approach:

Read a dictionary file and organize all words in a trie data structure. Many Unix systems have such files in the /usr/share/dict/ directory.
Find possible matches of a prefix of your input in the trie. This will usually produce multiple matches, for example theologyisabout begins with theology and the.
If we remove the matched prefixes, we get a set of possible continuations, on which we repeat step 2.

We then end up with a vast tree of possible interpretations.

There are two problems with this:

there will be an exponential amount of interpretations
we might miss interpretations because of an unknown word, or some unknown grammatical form

Thanks to amon I managed to get the algorithm going!

By using this here code a Trie was implemented and filled with the english dictionary (around 23600 words).

_V
THEQUICKBROWNFOX… Finds HE

__V
THEQUICKBROWNFOX… Finds EQ
and so forth.

Thank you for all the help in this!

Filed under: softwareengineering - @ 07:02

Thẻ: c++, data-mining, regular-expressions, statistics, strings

Word recognition in a string without spaces or punctuation marks

I have a small C# project that reads a file and gives me an output: a string that does not contain spaces nor any types of punctuation marks. It may also contain a few misspellings.

Ex.
Output:
THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

I wonder if there is a way to analyze this string by using text mining/data mining and/or regular expressions to indentify the words (preferably nounds, verbs and so forth.) in the string?

I want to read a bunch of files giving me different outputs and put them in statistical order from the one with the most found words to the one which only contain a string of mumbo jumbo.

Here is the general approach:

Read a dictionary file and organize all words in a trie data structure. Many Unix systems have such files in the /usr/share/dict/ directory.
Find possible matches of a prefix of your input in the trie. This will usually produce multiple matches, for example theologyisabout begins with theology and the.
If we remove the matched prefixes, we get a set of possible continuations, on which we repeat step 2.

We then end up with a vast tree of possible interpretations.

There are two problems with this:

there will be an exponential amount of interpretations
we might miss interpretations because of an unknown word, or some unknown grammatical form

Thanks to amon I managed to get the algorithm going!

By using this here code a Trie was implemented and filled with the english dictionary (around 23600 words).

_V
THEQUICKBROWNFOX… Finds HE

__V
THEQUICKBROWNFOX… Finds EQ
and so forth.

Thank you for all the help in this!

Filed under: softwareengineering - @ 07:02

Thẻ: c++, data-mining, regular-expressions, statistics, strings

Thiết kế website giá rẻ

Danh mục

Word recognition in a string without spaces or punctuation marks

Word recognition in a string without spaces or punctuation marks

Word recognition in a string without spaces or punctuation marks