I am trying to parallelize the execution of a python code that solves a numerical problem. It’s a relatively complicated code making heavy use of numpy. I want to solve the numerical problem for a large set of parameters. This should be easy to parallelize: just let different processes solve the problem for different regions of the parameter space.
Here is some (pseudo-)code illustrating how I attempted the parallelization over a 2D parameter space:
import numpy as np
import itertools
from multiprocessing import Pool
import MyNumericalProblemSolver
chunksize = 100
n_values_per_param = 50
n_CPUs = 8
p1_values = np.logspace(1,2,n_values_per_param)
p2_values = np.logspace(3,5,n_values_per_param)
param_iterator = itertools.product(p1_values,p2_values)
solver = MyNumericalProblemSolver()
def wrapper(params):
p1,p2 = params
solver.set_parameters(p1=p1,p2=p2)
solver.solve()
p = Pool(n_CPUs)
p.map(wrapper,param_iterator,chunksize=chunksize)
Execution of the above code typically takes several minutes. My laptop has 8 CPUs (output of multiprocessing.cpu_count()
), so I expect a performance improvement of roughly a factor 8. However, what I see is an improvement of only a factor 1.5.
The interesting thing is the CPU usage:
- If I don’t use multiprocessing, then 1 CPU is using 100%
n_CPUs = 1
: 1 CPU is using 100%n_CPUs = 2
: 2 CPUs, each using 100%n_CPUs = 4
: 4 CPUs, each using 50%n_CPUs = 8
: 8 CPUs, each using 25%
So it looks like the maximum total usage is 200%. I don’t understand why this is happening. I confirmed that when I start the program without multiprocessing in 8 different terminals, I get 8 CPUs each using 100%. Things I tried so far:
- change the
chunksize
; doesn’t help, small or large chunk sizes give the same performance - there are several similar questions mentioning that numpy (and other libraries) can mess with the core affinity (e.g. Why does multiprocessing use only a single core after I import numpy?, multiprocessing not achieving full CPU usage on dual-processor windows machine, Multiprocessing.Pool makes Numpy matrix multiplication slower). So I tried
os.system(f"taskset -p 0xff {os.getpid()}")
to reset the CPU affinity after importing numpy, but it doesn’t change anything. If I doos.sched_getaffinity(0)
before and after importing numpy, I get the same output: {0, 1, 2, 3, 4, 5, 6, 7}.
At this point I am running out of ideas of how to tackle this issue. I am running all this in a conda virtual env using python 3.11 on Ubuntu 22.04.
miraculix is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.