I have a function process_file
which takes a file name as input, processes the input, then saves the output.
I use multiprocessing.Pool.map()
to speed up the process by parallelisation across cores. Typically I use processes=os.cpu_count()
to have 1 process per core. So far, so good.
However, these files are sometimes very large compressed data, which means that sometimes loading os.cpu_count()
of these files takes more memory than the machine has, at which point either the process crashes or the machine freezes.
For any particular set of files I can make a guess at suitable value for the processes
parameter of multiprocessing.Pool
, but this requires manual intervention which it would be good to avoid. And if (as there often are) there are some big files and a lot of small ones, this can cause a significant slowdown to the processing speed.
Are there any libraries that deal with this in a smart way – i.e. launch enough processes to use most but not all of the memory?