Sorting versus hashing

My problem is as follows. I have an array of n strings with m < n of them distinct. I want to create a one-to-one function which assigns each of the m distinct strings to the numbers 0 ... m-1. For example, if my strings are:

Bob, Amy, Bob, Charlie, Amy

then the function:

Bob -> 0, Amy -> 1, Charlie -> 2

would meet my needs. I have thought of three possible approaches:

Sort the list of strings, remove duplicates, and construct the function using a search algorithm.
Create a hash table and check each string to see if it is already in the table before inserting it.
Sort the list of strings, remove duplicates, and put the resulting list into a hash table.

My code will be written in Java, and I will likely use standard Java algorithms: merge sort for sorting, binary search for searching, and whatever the standard Java hash table algorithm is.

Question: Assume that after creating the function I will have to evaluate it on each of the n original strings. Which of the three approaches is fastest? Is there a better way?

Part of the problem is that I don’t really know what’s going on “under the hood” in standard hashing algorithms. Any help would be appreciated.

Firstly, I’ll note that there are two different things to be concerned with. How long will it take to assign all the numbers, and how long will it take to lookup the number later.

Sort the list of strings, remove duplicates, and construct the function using a search algorithm

Sorting will take O(n log n) time, you can remove the duplicate in O(n) time. To lookup the number you can use binary search which will take O(log m) time for each search.

Create a hash table and check each string to see if it is already in the table before inserting it.

Hash table operations are typically considered to be O(1). This is a bit of lie because it depends on the number of collisions you get. But its close to O(1) and definitely better then O(log m). Checking and inserting all values will then be O(n). Fetching the counts later will be O(1) for each fetch.

Sort the list of strings, remove duplicates, and put the resulting list into a hash table.

Sorting and deduplicating will take O(n log n) time. Inserting all of that into a hash will take O(m) time. Then fetching the individual elements will take O(1) time. This is basically the same as the previous option except that you spend O(n log n) extra time sorting.

The second option will be fastest, at least for large enough cases.

The way I like to tackle this is coarse hash with buckets and sort and remove duplicates in parallel. This is for inputs of a sufficient scale to benefit from all of this and multithreading (say at least tens of thousands but perhaps more in the range of hundreds of thousands to millions on average). Otherwise this technique will tend to add more work than it saves.

By coarse hash, I mean, say, make 64 resizable random-access sequences.

<code>sequence buckets[64]

</code>

<code>sequence buckets[64] </code>

sequence buckets[64]

Then for each string, just examine the first character and put it in the right bucket.

<code>for each string:

buckets[string[0] % 64].append(string)

</code>

<code>for each string: buckets[string[0] % 64].append(string) </code>

for each string:
    buckets[string[0] % 64].append(string)

Then with a parallel for loop, sort each bucket and remove duplicates.

<code>parallel for each bucket:

sort(bucket)

unique(bucket)

</code>

<code>parallel for each bucket: sort(bucket) unique(bucket) </code>

parallel for each bucket:
    sort(bucket)
    unique(bucket)

It’s an easy way to parallelize many operations with this first “coarse hashing” step. In this case, with the “coarse hashing”, you actually want many collisions to fill up the buckets to process in parallel, and ideally in a way where there’s a reasonably even distribution of collisions per bucket.

Ideally you also construct a list of references/pointers to buckets which aren’t empty prior to processing them in parallel, since it’d be a waste to grab a thread from the thread pool just to process an empty bucket. You could also hash in parallel to remove duplicates but typically I find it easier and less fiddly to achieve better results with sorting for each bucket. Otherwise the threads can end up allocating a whole lot of memory.

Now if you want to get fancy, let’s say the first step didn’t evenly distribute the elements into buckets so well, so one thread encounters a bucket with like a million elements in it. In that case you can repeat the step again in that thread/bucket and coarse hash the element using the second character as key and then recursively sort/unique those sub-buckets in parallel.

Some benchmarks:

<code>Sorting 10000000 elements 3 times...

mt_sort_int: {0.135 secs}