I have a huge file (~16TB) that contains a list of 8-byte keys mapped to 8-byte values. (2^30 pairs, 16 bytes each).
I would now like to optimize the file so that I can search it efficiently. I have currently sorted the file and perform a binary search on it. This completes in 30 reads, but these reads are highly distributed around the file, especially at the start.
I am aware I can load the entire binary search partition that’s left after 10 steps into 16GB memory and continue the search there. However, I only have a negligible amount of memory available, so this isn’t an option.
Is there a way to arrange the data on disk so that the accesses required to search the file are close together from the start? This would allow me to load an entire “range” of values that need to be read into memory, to reduce the number of read
calls in total, and reduce the number of random accesses.