I have a random text and inside this text there could be English words.
Example:
19:51 % & @ N Q= 74%m
- - —
FR1ENDS PAR
Hello %
& BYE## %
N
oo:oo 6 -
. . - ~
AMOUNT 82,354,763/ 21,000,000 0/4nmsng
O0a® ©
HISTORY BOOK
(+.] 1] @) <
What I basically thought is basically to split text into words and then check if that word is contained inside the English dictionary.
If the detected word is an exact match then the operation would be O(1).
But if the word isn’t exactly a match because it’s slightly different, such as “FR1ENDS”, then I would to use an algorithm that check for string similarity and perform that check on the whole dictionary, which would be O(N) and for the English dictionary that would mean a very expensive operation.
So, I am thinking if maybe I could use a regex to remove words or line that simply does not make sense, such as “oo:oo 6 -“, this way even though I wouldn’t have the exact match of the English words I would know that the remaining words or lines make some sense.
But I have no idea how to create a regex like that.
I am open to any other suggestion.
zaxunobi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.