What are the advantages of linear probing over separate chaining or vice-versa when implementing hash tables?

I’ve been brushing up on algorithms and reviewed these two methods of implementing hash tables. It seems like they largely have similar performance characteristics and memory requirements.

I can think of some disadvantages to linear probing — namely, that widening the array could be expensive (but this is done, what, 2 log N times at most? Probably not a big deal) and that managing deletions is a bit more difficult. But I’m assuming there are advantages, too, or it wouldn’t be presented in textbooks as a viable method of implementation next to the more obvious implementation.

Why would you choose one over the other?

With linear probing (or any probing really) a deletion has to be “soft”. This means you need to put in a dummy value (often called a tombstone) that won’t match anything the user could search for. Or you would need to rehash every time. Rehashing when too many tombstones build up is still advised or some strategy to defrag the graveyard.

Separate chaining (each bucket is a pointer to a linked list of values) has the disadvantage that you end up searching a linked list with all cache-related issues at hand.

One other advantage of the probing method is that the values all live in the same array. This makes copy-on-write very easy by just copying only the array. If you can be assured the original is not modified by way of class invariant then a taking a snapshot is O(1) and can be done without locking.

Check out this great answer:

https://stackoverflow.com/questions/23821764/why-do-we-use-linear-probing-in-hash-tables-when-there-is-separate-chaining-link

Quoting here:

I’m surprised that you saw chained hashing to be faster than linear probing – in practice, linear probing is typically significantly faster than chaining. In fact, that’s the main reason it’s used.

Although chained hashing is great in theory and linear probing has some known theoretical weaknesses (such as the need for five-way independence in the hash function to guarantee O(1) expected lookups), in practice linear probing is typically significantly faster due to locality of reference. Specifically, it’s faster to access a series of elements in an array than it is to follow pointers in a linked list, so linear probing tends to outperform chained hashing even if it has to investigate more elements.

There are other wins in chained hashing. For example, insertions into a linear probing hash table don’t require any new allocations (unless you’re rehashing the table), so in applications like network routers where memory is scarce, it’s nice to know that once the table is set up, the elements can be placed into it with no risk of a malloc fail.

I’ll jump in with a biased answer where I actually prefer separate chaining with singly-linked lists and find it easier to achieve performance with them (I’m not saying they’re optimal, just easier for my use cases), as contradictory as that sounds.

Of course the theoretical optimum is still a hash table without collisions whatsoever or a probing technique with minimal clustering. However, the separate chaining solution doesn’t have to deal with clustering problems whatsoever.

That said, the data representation I use does not invoke a separate memory allocation per node. Here it is in C:

struct Bucket
{
    int head;
};

struct BucketNode
{
    int next;
    int element;
};

struct HashTable
{
    // Array of buckets, pre-allocated in advance.
    struct Bucket* buckets;

    // Array of nodes, pre-allocated assuming the client knows
    // how many nodes he's going to insert in advance. Otherwise
    // realloc using a similar strategy as std::vector in C++.
    struct BucketNode* nodes;

    // Number of bucket heads.
    int num_buckets;

    // Number of nodes inserted so far.
    int num_nodes;
};

The buckets are just 32-bit indices (I don’t even use a struct in actuality) and the nodes are just two 32-bit indices. Often I don’t even need the element index because the nodes are often stored in parallel with the array of elements to be inserted into the table, reducing the overhead of the hash table to 32-bits per bucket and 32-bits per element inserted. The real version I use more often looks like this:

struct HashTable
{
    // Array of head indices. The indices point to entries in the 
    // second array below.
    int* buckets;

    // Array of next indices parallel to the elements to insert.
    int* next_indices;

    // Number of bucket heads.
    int num_buckets;
};

Also if spatial locality degrades, I can easily perform a post-processing pass where I construct a new hash table where each bucket node is contiguous with the other (trivial copy function which just does a linear pass through the hash table and constructs a new one — due to the nature in which it traverses the hash table, the copy ends up with all neighboring nodes in a bucket contiguous to each other).

As for probing techniques, it comes with the benefits that the spatial locality is already there from the beginning without memory pools or a backing array as I use, and they also don’t have the 32-bit overhead per bucket and node, but then you might have to deal with clustering problems which can start to accumulate in a vicious way with lots of collisions.

I find the very nature of clustering to be a headache that requires a lot of analysis in the event of many collisions. The benefit of this solution is that I can often achieve a decent result the first time around without such deep analysis and testing. Also if the table resizes on its own implicitly, I’ve run into cases where such designs ended up blowing up memory usage in ways that far surpassed what this basic solution which requires a 32-bits per bucket and 32-bits per node would run into even in the worst-case scenario. It’s a solution that avoids becoming too bad even if there are a number of collisions.

Most of my codebase revolves around data structures which store indices and are often storing indices in parallel with the array of elements to be inserted. That crunches down the memory size, avoids superfluous deep copies of the elements to insert, and makes it very easy to reason about memory usage. Aside from that, in my case I tend to benefit as much from predictable performance as optimal performance. An algorithm which is optimal in many common case scenarios but can perform horribly in worst-case scenarios is often less preferable for me than one which performs reasonably well all the time and doesn’t cause frame rates to stutter at unpredictable times, and so I tend to favor those kinds of solutions.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 16:44

Thẻ: data-structures, dictionary, hashing

Thiết kế website giá rẻ

Danh mục

What are the advantages of linear probing over separate chaining or vice-versa when implementing hash tables?