I have millions of images that I want to read as fast as possible but I want to be able to read them in a random order.
I stored them in a HDF5 file but I found out that the reading time is very much increased if the access is in a random order instead like the code and profile time line shows here :
with h5py.File("/slowdata/caid_2024/GenImage_compressed.h5") as hf:
keys = list(hf.keys())
from random import shuffle
sleep(0.1)
for i, key in enumerate(keys):
if i == 16:
break
np.array(hf[key])
# Random order now
shuffle(keys)
sleep(0.1)
for i, key in enumerate(keys):
if i == 16:
break
np.array(hf[key])
sleep(0.1)