I have a sorted vector vec
of 10 million 21-digit integers. A process generates around 10^12 21-digit integers (unordered) and I need to check which of them are in my vector. What algorithm would be reasonably time efficient for that?
Since the answer will depend on resources available, I can say that I am realistically able to allocate a vector of size 30GB on my computer and that I plan to implement the algorithm in C++.
I can’t use std::set_intersection
(or it wouldn’t be efficient), since the set of 10^12 21-digit integers is unordered and too large.
A bool “index vector” iv
with iv[n]
indicating whether n is in vec
, would make the searches very fast, but such iv
would have 10^21 entries and would not fit into memory.
Right now the only solution I see is dividing the interval [10^21, 10^22-1] into N bins, each of size S, and building an “index vector” biniv
with biniv[n]
pointing to the position of the first entry of vec
larger than n*S. Then for each 21-digit integer X, I would need to only check vec
entries between vec[n*S]
and vec[(n+1)*S]
where an appropriate n can be easily computed. (A binary search would be employed for that check.)
But maybe there are more clever ways to approach this problem?