I noticed that numpy.linalg.solve
has vastly different performance since v2.0.0, on both Linux and Windows (haven’t tested it on macOS). Basically, for reasonable matrices (up to about 1800×1800) the new versions are much faster (up to 10x faster of my Linux machine, I got even larger speedups on Windows), then for larger ones (up to about 8800×8800) the 1.26.4 version is faster, and for even larger ones the 2.0.0 version again becomes faster.
I wonder what happened? Is that due to an update of the underlying library?
If there were some modifications that impacted performance, I would have expected something to appear in the release notes, but there wasn’t anything of the sort.
I ran a small benchmark using perfplot
:
import matplotlib.pyplot as plt
import numpy as np
import perfplot
def main():
plot = False
if not plot:
out = perfplot.bench(
setup=lambda n: np.random.rand(n + 1, n),
kernels=[lambda a: np.linalg.solve(a[1:], a[0])],
n_range=[2**k for k in range(15)],
)
np.savetxt("n.txt", out.n_range)
np.savetxt(f"timings_{np.__version__.replace(".", "_")}.txt", out.timings_s)
else:
n = np.loadtxt("n.txt")
timings_1_26_4 = np.loadtxt("timings_1_26_4.txt")
timings_2_0_0 = np.loadtxt("timings_2_0_0.txt")
plt.plot(n, timings_1_26_4, label="1.26.4")
plt.plot(n, timings_2_0_0, label="2.0.0")
plt.legend()
plt.show()
plt.plot(n, timings_1_26_4 / timings_2_0_0, label="1.26.4 / 2.0.0")
plt.plot([0, 2**14], [1.0, 1.0])
plt.legend()
plt.show()
if __name__ == "__main__":
main()
I ran it once with numpy==1.26.4
and once with numpy==2.0.0
, and then a third time with plot=True
to just plot.
Here is the overall result (x-axis is number of unknowns of the linear system, y-axis is time it takes to solve in seconds):
Performance comparison of numpy.linalg.solve
between v1.26.4 and v2.0.0.
Here is a zoom up to n=200
:
Zoom on the beginning of the plot.
And here is the ratio time for 1.26.4 / time for 2.0.0 (the horizontal line is y=1 to see which is faster):
Ratio of the timings for versions 1.26.4 and 2.0.0. When the curve is above the horizontal line, it means that v2.0.0 is faster.
In the last image, one can see that there is a peak at about n=35
where we have 10x.
6
I can reproduce the problem on my machine with an AMD Zen2 CPU (Ryzen 7 5700U).
With Numpy 1.26.4, low-level profiling with perf
shows that most of the time is spent in libopenblas64_p-r0-0cf96a72.3.23.dev.so
and more specifically two functions : dgemm_kernel_ZEN
and inner_advanced_thread
. Note inner_advanced_thread
seems to just run a loop reading the same variable over and over so it is certainly a kind of spinning wait (it likely wait for the threads to complete).
With Numpy 2.1.1, perf
shows that most of the time is spent in libscipy_openblas64_-ff651d7f.so
and more specifically two functions : blas_thread_server
, dgemm_kernel_HASWELL
and dcopy_k_HASWELL
. Note blas_thread_server
seems to just run a loop running rdtsc
and nop
instructions so it certainly don’t compute anything but just wait for threads to be ready, similar to inner_advanced_thread
. Also note that the copy uses SSE instructions while my CPU supports AVX2 so it is likely sub-optimal.
We can see two important points:
- The version of OpenBLAS changed from Numpy 1.26 to Numpy 2.1;
- Numpy 1.26 calls
dgemm
BLAS functions optimized for my Zen CPU, while Numpy 2.1 calls one optimized for Intel Haswell CPUs likely less optimized for my Zen CPU. It also performs an expensive copy.
While the former point is expected when upgrading Numpy to a new major version (though it can impact performance), the later clearly seems like a packaging bug : the new linked BLAS library is not as optimized as the old one for the target platform.
We can see that the new OpenBLAS library mentions Scipy while the previous one does not. I think the problem is that Numpy now use the BLAS required by Scipy which is not as good as the version they packaged directly in Numpy (more specifically, the Scipy version looks like a generic one and the performance of BLAS libraries is highly dependent of the target CPU features used).
Update
With Numpy 2.0.0, dgemm_kernel_ZEN
is called and the results seem close to the one of Numpy 2.1.1. Thus, the problem is finally not the architecture specialization. <– However, I still see a quite expensive copy (dcopy_k_ZEN
). Other minor functions still take more time than in version 1.26.4 (similar to version 2.1.1).–> Cores are clearly less well used in the version 2.0.0 and 2.1.1 than 1.26.4 possibly coming from a inefficient OpenMP scheduling (only set for the Scipy version). In fact, computation is nearly sequential with Numpy 2.0.0! The target library is still the one of Scipy. The use of the Scipy library is still the main difference between version 1.26.4 and 2.0.0.
If we use only one core, the version 1.26.4 and 2.0.0 are equally fast. This can be done with OMP_NUM_THREADS=1
. It confirms the problem comes from the poor scalability of the Scipy version. That being said, tuning OMP_SCHEDULE
has no visible effect (this typically happen if developers set the scheduling in the code, or the scheduling may not be the source of the scalability issue).
Notes
Note that there is an open issue for Intel CPUs having a big-little architecture (i.e. recent Intel CPUs). A similar issue might happen on other big-little CPUs (eg. AFAIK some Apple M1/M2/M3 CPUs)
Note you can also track the loaded libraries (including BLAS) with strace
on Linux. More specifically, you can list the (sorted) dynamic dependencies with the following command:
strace -e trace=open,openat python your_script.py 2>&1 | grep -E ""[^"]*.so"" -o | sort -u
3