I have a large number of image files (~220,000) stored on a fast, local SSD. Using python and the tifffile library I read the images in as numpy arrays, which are then combined into a single array and saved to disk. Reading this combined array is much faster than separately reading the files.
I’m trying to understand why there are writes (upwards of 30 MB/s) happening during the reading of the data (Expected: all reads happen, then a combined array is created, then one write happens). There’s clearly more than enough RAM available while this is happening (the entire dataset fits in RAM):
I assume there’s data still available in memory (but not marked as using memory) to explain why the first ~15 GB is loaded without incurring any disk reads (reading starts near the left-side of the plot).
A basic code example is something like this:
import os
import numpy as np
from tifffile import imread
from functools import partial
from tqdm.contrib.concurrent import process_map
def get_image(dir, ID):
# Load as a numpy array
return imread(os.path.join(dir, ID + ".tif"))
def generate_numpy_file(IDs, folder, fname="train"):
_read = partial(get_image, folder)
print("Reading Data")
images = process_map(_read, IDs, max_workers=20, chunksize=1024)
images = np.array(images)
print("Writing Data")
np.save(os.path.join(SCRIPT_DIR, "Datasets", fname), images)