I’m creating an HDF5 file that contains two databases (images, text). The size of the HDF5 is over an order of magnitude larger than the image files + text files. Is this due to an issue with the way I’m storing the data or text? I’m unable to figure out why. The images are stored as bytes and the text is stored as the recommended text format by HDF5.
I have about 80GB of images and about 40MB of text. The HDF5 file is about 1.5TB.
I have restricted the max size of each image to be 3MB (which holds true to the dataset)
hf = h5py.File('/home/dev/ssd/L_Dataset.h5', 'a')
dt = h5py.string_dtype(encoding='utf-8')
total_size = len(list(text_dict.keys()))
text_dataset = hf.create_dataset('text', (total_size,), dtype=dt)
image_dataset = hf.create_dataset('images', (total_size,), dtype=h5py.opaque_dtype('V3000000'))
i = 0
for key in list(text_dict.keys()):
with open('/home/dev/Datasets/images/' + key + ".jpg", 'rb') as img_f:
binary_data = img_f.read() # read the image as python binary
binary_data_np = np.void(binary_data)
text_dataset[i] = text_dict[key]
image_dataset[i] = binary_data_np
if i % 100 == 0:
print("Processed {} images.".format(i))
img_f.close()
i+=1
hf.close()