Problem I’m trying to solve, apologies in advance for the length:
Given a large number of stored records, each with a unique (String) field S. I’d like to be able to find through an indexed query all records where Hash(S) % N == K for any arbitrary N, K (e.g. given a million strings, find all strings where HashCode(s) % 17 = 5. Is there some way of memoizing this so that we can quickly answer any question of this form without doing the % on every value?
The motivation for this is a system of N distributed nodes, where each record has to be assigned to at least one node. The nodes are numbered 0 – (K-1) , and each node has to load up all of the records that match it’s number:
If we have 3 nodes
- Node 0 loads all records where Hash % 3 ==0
- Node 1 loads all records where Hash % 3 ==1
- Node 2 loads all records where Hash % 3 ==2
adding a 4th node, obviously all the assignments have to be recomputed –
- Node 0 loads all records where Hash % 4 ==0
- …
- etc
I’d like to easily find these records through an indexed query without having to compute the mod individually.
The best I’ve been able to come up with so far:
If we take the prime factors of N (p1 * p2 * … )
if N % M == I then p % M == I % p for all of N’s prime factors
e.g. 10 nodes :
N % 10 == 6 then
- N % 2 = 0 == 6 %2
- N % 5 = 1 == 6 %5
so storing an array of the “%” of N for the first “reasonable” number of primes for my data set should be helpful. For example in the above example we store the hash and the primes
HASH PRIMES (array of %2, %3, %5, %7, … ])
16 [0 1 1 2 .. ]
so looking for N%10 == 6 is equivalent to looking for all values where array[1]==1 and array[2] == 1.
However, this breaks at the first prime larger than the highest number I’m storing in the factor table. Is there a better way?
5
How important is it that formula be “Hash % num_machines”? This formula is used for distributed caches, like memcached. It works great until you add/remove nodes. At that point, the advice is to abandon it and use consistent hashing.
1
Unless I’ve misunderstood something, your conjecture is incorrect.
If we take the prime factors of N (p1 * p2 * … )
if N % M == I then p % M == I % p for all of N’s prime factors
How did you come up with that?
Let’s say N = 36
and M = 6
The prime factorization of N = 2 * 2 * 3 * 3
. 36 % 6 = 0
According to your statement, the following should hold:
p % 6 = 0 % p = 0
But clearly, this is not the case: 2 % 6 = 2 != 0