I use a hash table as the main data structure for a project in C. I use this implementation of SipHash as the hash function. I use chained lists to handle collisions. The index for a given element to be inserted or searched in the hash table is computed as follows:
size_t genindex(const char *key, const size_t ht_len, const uint8_t *k) {
uint64_t hash = siphash((const uint8_t *)key, strlen(key), k);
size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
return index;
}
This approach creates a lot of collisions as about 60% of my table is unused. The data I store in the hash table is relatively small (about 60 elements). I could use another collision handling method, but this requires a lot more work and I will consider it if I fail to find a solution. I think I should modify the index resize method.
Are there other method to resize the hash to fit it to the hash table size ht_len
than using a modulo operation ?
Note: the table size is fixed at its initialization. It is generated as the closest prime number to the data size and its chosen between 2^i and 2^(i+1), the powers of two lower and greater than the data size.
13
As others have commented, using a hash table seems overly complicated for the purpose.
However, the problem with the insufficient spreading is probably due to this line:
size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
which is supposed to calculate “modulo ht_len
“. However, using &
for modulo only works if ht_len
is a power of two.
Doing &
with ht_len-1
will generally spread the result over a set of indices of 2^n values where “n” is the number of ones in the binary representation of ht_len-1
. Or put in another way: all indices, i
, not fulfilling i & (ht_len-1) == i
will be excluded.
For example, consider the values 0, …, 15. If ht_len
is 8, we AND with 8-1=7 and get:
0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
(i.e. the results are spread over the 8 possible remainders).
If ht_len
is 7, we AND with 7-1=6 and get:
0, 0, 2, 2, 4, 4, 6, 6, 0, 0, 2, 2, 4, 4, 6, 6
(i.e. only 4 results are possible).
Doing actual modulo 7
would give:
0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1
So in short, the mentioned line should be changed to calculate modulo correctly:
size_t index = (size_t)(hash % (uint64_t) (ht_len));
You wrote:
I use a hash table as the main data structure for a project in C. I use […] SipHash as the hash function. I use chained lists to handle collisions. […] This approach creates a lot of collisions as about 60% of my table is unused.
In comments, you clarified:
[the data are] the letters of the alphabet + numbers and punctuation and their corresponding morse code.
In that case, your approach is absurdly complex. With only single-byte keys, the identity function would make a superb hash function, and you can afford to structure your “hash table” as a simple array with a separate element for each possible hash value (and some unused ones). The identity is a perfect hash function,* and there is a different bucket for each key, so you will never have any collisions. This kind of hash table is more conventionally called a lookup table.
That is,
Are there other method to resize the hash to fit it to the hash table size ht_len than using a modulo operation ?
Yes: choose a hash function that produces a smaller range of values, such as (in your case) the identity function.
Alternatively, you could choose another function that compresses the range of hash values even more while still being much simpler than what you’re using now, but why?
Or if you’re trying to emulate Rube Goldberg, then yes, there are all sorts of functions you could apply to your hash values to remap them into a narrower range.
The modulo operation is conventional for this because it is fast (compared to most alternatives) and applicable to any target size, but it does have the disadvantage that if the table size and hash size are not coprime then some of the bits of hash values might not contribute to selection of the corresponding bucket. You can mitigate this by, say, breaking the hash into pieces of suitable size and xor-ing them all together, maybe combined with other processing, depending on the exact table size you want to support.
*meaning that it produces a different output for each distinct input in its domain
2