Spread index for hash table implementation

I use a hash table as the main data structure for a project in C. I use this implementation of SipHash as the hash function. I use chained lists to handle collisions. The index for a given element to be inserted or searched in the hash table is computed as follows:

size_t genindex(const char *key, const size_t ht_len, const uint8_t *k) {
    uint64_t hash = siphash((const uint8_t *)key, strlen(key), k);
    size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
    return index;
}

This approach creates a lot of collisions as about 60% of my table is unused. The data I store in the hash table is relatively small (about 60 elements). I could use another collision handling method, but this requires a lot more work and I will consider it if I fail to find a solution. I think I should modify the index resize method.

Are there other method to resize the hash to fit it to the hash table size ht_len than using a modulo operation ?

Note: the table size is fixed at its initialization. It is generated as the closest prime number to the data size and its chosen between 2^i and 2^(i+1), the powers of two lower and greater than the data size.

As others have commented, using a hash table seems overly complicated for the purpose.

However, the problem with the insufficient spreading is probably due to this line:

    size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));

which is supposed to calculate “modulo ht_len“. However, using & for modulo only works if ht_len is a power of two.

Doing & with ht_len-1 will generally spread the result over a set of indices of 2^n values where “n” is the number of ones in the binary representation of ht_len-1. Or put in another way: all indices, i, not fulfilling i & (ht_len-1) == i will be excluded.

For example, consider the values 0, …, 15. If ht_len is 8, we AND with 8-1=7 and get:

    0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7

(i.e. the results are spread over the 8 possible remainders).

If ht_len is 7, we AND with 7-1=6 and get:

    0, 0, 2, 2, 4, 4, 6, 6, 0, 0, 2, 2, 4, 4, 6, 6

(i.e. only 4 results are possible).

Doing actual modulo 7 would give:

    0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1

So in short, the mentioned line should be changed to calculate modulo correctly:

    size_t index = (size_t)(hash % (uint64_t) (ht_len));

You wrote:

I use a hash table as the main data structure for a project in C. I use […] SipHash as the hash function. I use chained lists to handle collisions. […] This approach creates a lot of collisions as about 60% of my table is unused.

In comments, you clarified:

[the data are] the letters of the alphabet + numbers and punctuation and their corresponding morse code.

In that case, your approach is absurdly complex. With only single-byte keys, the identity function would make a superb hash function, and you can afford to structure your “hash table” as a simple array with a separate element for each possible hash value (and some unused ones). The identity is a perfect hash function,^* and there is a different bucket for each key, so you will never have any collisions. This kind of hash table is more conventionally called a lookup table.

That is,

Are there other method to resize the hash to fit it to the hash table size ht_len than using a modulo operation ?

Yes: choose a hash function that produces a smaller range of values, such as (in your case) the identity function.

Alternatively, you could choose another function that compresses the range of hash values even more while still being much simpler than what you’re using now, but why?

Or if you’re trying to emulate Rube Goldberg, then yes, there are all sorts of functions you could apply to your hash values to remap them into a narrower range.

The modulo operation is conventional for this because it is fast (compared to most alternatives) and applicable to any target size, but it does have the disadvantage that if the table size and hash size are not coprime then some of the bits of hash values might not contribute to selection of the corresponding bucket. You can mitigate this by, say, breaking the hash into pieces of suitable size and xor-ing them all together, maybe combined with other processing, depending on the exact table size you want to support.

^*meaning that it produces a different output for each distinct input in its domain

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 02:26

Thẻ: calgorithmhashtable

Thiết kế website giá rẻ

Danh mục

Spread index for hash table implementation