Sorting versus hashing

My problem is as follows. I have an array of n strings with m < n of them distinct. I want to create a one-to-one function which assigns each of the m distinct strings to the numbers 0 ... m-1. For example, if my strings are:

Bob, Amy, Bob, Charlie, Amy

then the function:

Bob -> 0, Amy -> 1, Charlie -> 2

would meet my needs. I have thought of three possible approaches:

  1. Sort the list of strings, remove duplicates, and construct the function using a search algorithm.

  2. Create a hash table and check each string to see if it is already in the table before inserting it.

  3. Sort the list of strings, remove duplicates, and put the resulting list into a hash table.

My code will be written in Java, and I will likely use standard Java algorithms: merge sort for sorting, binary search for searching, and whatever the standard Java hash table algorithm is.

Question: Assume that after creating the function I will have to evaluate it on each of the n original strings. Which of the three approaches is fastest? Is there a better way?

Part of the problem is that I don’t really know what’s going on “under the hood” in standard hashing algorithms. Any help would be appreciated.

7

Firstly, I’ll note that there are two different things to be concerned with. How long will it take to assign all the numbers, and how long will it take to lookup the number later.

Sort the list of strings, remove duplicates, and construct the function using a search algorithm

Sorting will take O(n log n) time, you can remove the duplicate in O(n) time. To lookup the number you can use binary search which will take O(log m) time for each search.

Create a hash table and check each string to see if it is already in the table before inserting it.

Hash table operations are typically considered to be O(1). This is a bit of lie because it depends on the number of collisions you get. But its close to O(1) and definitely better then O(log m). Checking and inserting all values will then be O(n). Fetching the counts later will be O(1) for each fetch.

Sort the list of strings, remove duplicates, and put the resulting list into a hash table.

Sorting and deduplicating will take O(n log n) time. Inserting all of that into a hash will take O(m) time. Then fetching the individual elements will take O(1) time. This is basically the same as the previous option except that you spend O(n log n) extra time sorting.

The second option will be fastest, at least for large enough cases.

7

The way I like to tackle this is coarse hash with buckets and sort and remove duplicates in parallel. This is for inputs of a sufficient scale to benefit from all of this and multithreading (say at least tens of thousands but perhaps more in the range of hundreds of thousands to millions on average). Otherwise this technique will tend to add more work than it saves.

By coarse hash, I mean, say, make 64 resizable random-access sequences.

sequence buckets[64]

Then for each string, just examine the first character and put it in the right bucket.

for each string:
    buckets[string[0] % 64].append(string)

Then with a parallel for loop, sort each bucket and remove duplicates.

parallel for each bucket:
    sort(bucket)
    unique(bucket)

It’s an easy way to parallelize many operations with this first “coarse hashing” step. In this case, with the “coarse hashing”, you actually want many collisions to fill up the buckets to process in parallel, and ideally in a way where there’s a reasonably even distribution of collisions per bucket.

Ideally you also construct a list of references/pointers to buckets which aren’t empty prior to processing them in parallel, since it’d be a waste to grab a thread from the thread pool just to process an empty bucket. You could also hash in parallel to remove duplicates but typically I find it easier and less fiddly to achieve better results with sorting for each bucket. Otherwise the threads can end up allocating a whole lot of memory.

Now if you want to get fancy, let’s say the first step didn’t evenly distribute the elements into buckets so well, so one thread encounters a bucket with like a million elements in it. In that case you can repeat the step again in that thread/bucket and coarse hash the element using the second character as key and then recursively sort/unique those sub-buckets in parallel.

Some benchmarks:

Sorting 10000000 elements 3 times...

mt_sort_int: {0.135 secs}
-- small result: [ 12 17 17 22 52 55 67 73 75 87 ]

mt_sort: {0.445 secs}
-- small result: [ 12 17 17 22 52 55 67 73 75 87 ]

mt_radix_sort: {0.228 secs}
-- small result: [ 12 17 17 22 52 55 67 73 75 87 ]

std::sort: {1.697 secs}
-- small result: [ 12 17 17 22 52 55 67 73 75 87 ]

qsort: {2.610 secs}
-- small result: [ 12 17 17 22 52 55 67 73 75 87 ]

mt_sort, mt_radix_sort, and mt_sort_int above uses the technique described above (though with a lot more fluff to be used in production like special cases to handle inputs of different sizes) minus the linear pass to eliminate consecutive duplicates (trivial processing). The only differences between the three is the sorting algorithm they use (mt_sort and mt_radix_sort use linear-time sorts, mt_sort uses linearithmic). The differences get more and more pronounced the larger the input size with the fastest mt_* versions being even faster proportionally and the slower ones from the C and C++ standard library (qsort and std::sort) being even slower relatively.

The only comparison-based sort I know out there that rivals mt_sort is Intel’s parallel sort from Thread Building Blocks which outperforms mt_sort in some cases while mt_sort outperforms it in others (pretty neck-to-neck).

If you want to just filter out duplicates for the purposes of interning (mapping a unique index to each unique string), you can do this in a fraction of the time of the benchmarks above, since most of the time spent in mt_sort, mt_radix_sort, and mt_sort_int isn’t partitioning elements into buckets and then sorting the buckets in parallel (qsort and std::sort don’t need this step since they’re single-threaded sorts that don’t bother with buckets). It’s merging the results of the buckets into one giant sorted list. When you just want to filter out duplicates, you don’t need that elaborate merge step between the buckets.

Part of the problem is that I don’t really know what’s going on “under
the hood” in standard hashing algorithms. Any help would be
appreciated.

I don’t know for sure either but I think most of them use open addressing and possibly just linear probing since it lends itself well to cache hits.

I actually find more use out of hash tables with separate chaining that never allocate more buckets than the user requests which makes the memory use of the entire table perfectly predictable (4 bytes per bucket + 8 bytes per element inserted) but the trick to making them reasonable competitors against open addressing is to avoid heap allocations per node and just linking up nodes through indices, like so:

… with the node indices doubling over as either an index to the next used node in the bucket or the next free node to reclaim upon subsequent intersections if the node was removed. Spatial locality can degrade with this technique but making a copy of the hash table then makes each neighbor in a bucket perfectly contiguous if your use case can afford an “optimization” pass from time to time to improve locality of reference.

I tend to beat things like unordered_map in C++ using the above technique, though theoretically the optimal efficiency probably lies with unordered_map if you can get the hashing just right. Where I like mine better is that I don’t have to put that much thought into the hashing and get something very reasonable and it also doesn’t explode in memory use behind my back. I know exactly how much memory it will take upfront just based on how many buckets I request and how many elements I insert.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật