I am in a search of a better performant algorithm to implement a program, to find the count of each word in a sentence i.e.., a String in the programming sense.
I know my algorithm is too basic and Noob. Here’s my algorithm.
- Read the String.
- Tokenize the String with the delimiter, put each token in a list(ArrayList).
- Iterate the generated list, in each iteration compare the current string with every other string of the list, if it matches increase the count of this string.
- For each iteration, put the String and its count in a map.
- Iterate the map and display, with key as string and value as its count.
What is the best approach than this?
1
Replace steps 3 and 4 with the following
- If string is not in the map, store it in the map with value
Integer.valueOf(1)
- If is already in the map, get the current value, increment it, and replace the mapping.
0
- Create a Scanner for your InputStream, Reader, File, String, etc
- Iterate over the tokens
- If a word doesn’t already exist in your Map, add it with a value of 1.
- If a word already exists in the map, increment its value.
- When done, close the Scanner.
- Iterate over the map and display.
Your solution and @parsifal’s take a couple of unnecessary steps that are avoided with this solution:
- When using a Scanner to read an InputStream, Reader, or File, you don’t need to read the whole thing into memory at once, making this solution work with very large inputs.
- In addition to storing the entire input in a String, you duplicate the whole thing (minus spaces) in an ArrayList. In this solution there is no intermediate storage.
Here’s a quick intro to Scanner.
The inefficient part of your algorithm is in fact the ArrayList, because it can’t be traversed in a efficient way. The appropriate data structure is in your case a binary tree.
A very detailed discussion of binary trees: http://www.shiffman.net/teaching/a2z/concordance/
1
The trick is to use the right data structure for the job, and in this case the right data structure is a multiset. In fact, the multiset is the answer, you don’t even need an algorithm.
Here’s an example in Ruby:
require 'multiset'
s = 'Hello World Hello Programming Hello World'
Multiset.new(s.split)
# => #<Multiset:#3 "Hello", #2 "World", #1 "Programming">