I was told to create two lists concerning most frequently used words from a plain text (10 Mb arbitrary texts) as monograms (for single worded expressions such as human, water, is) and bigrams (for two-worded expressions such as basketball team, united states and etc)
I am stuck here and don’t know how I can go about it! And how I can distinguish between these two?
My domain is not English, I only gave those examples to make my intention and meaning more clear.
9
You can try to read from text word by word and make 2 instances of Dictionary, one for monograms and one for bigrams, having the expression as Key and occurrence as Value. With this you can make some statistics about expressions usage.
You can also use database storage for bigger files.