How come the performance of numpy.linalg.solve has changed drastically between versions 1.26.4 and 2.0.0?

I noticed that numpy.linalg.solve has vastly different performance since v2.0.0, on both Linux and Windows (haven’t tested it on macOS). Basically, for reasonable matrices (up to about 1800×1800) the new versions are much faster (up to 10x faster of my Linux machine, I got even larger speedups on Windows), then for larger ones (up to about 8800×8800) the 1.26.4 version is faster, and for even larger ones the 2.0.0 version again becomes faster.

I wonder what happened? Is that due to an update of the underlying library?

If there were some modifications that impacted performance, I would have expected something to appear in the release notes, but there wasn’t anything of the sort.

I ran a small benchmark using perfplot:

import matplotlib.pyplot as plt
import numpy as np
import perfplot


def main():
    plot = False

    if not plot:
        out = perfplot.bench(
            setup=lambda n: np.random.rand(n + 1, n),
            kernels=[lambda a: np.linalg.solve(a[1:], a[0])],
            n_range=[2**k for k in range(15)],
        )

        np.savetxt("n.txt", out.n_range)
        np.savetxt(f"timings_{np.__version__.replace(".", "_")}.txt", out.timings_s)
    else:
        n = np.loadtxt("n.txt")
        timings_1_26_4 = np.loadtxt("timings_1_26_4.txt")
        timings_2_0_0 = np.loadtxt("timings_2_0_0.txt")

        plt.plot(n, timings_1_26_4, label="1.26.4")
        plt.plot(n, timings_2_0_0, label="2.0.0")
        plt.legend()
        plt.show()

        plt.plot(n, timings_1_26_4 / timings_2_0_0, label="1.26.4 / 2.0.0")
        plt.plot([0, 2**14], [1.0, 1.0])
        plt.legend()
        plt.show()


if __name__ == "__main__":
    main()

I ran it once with numpy==1.26.4 and once with numpy==2.0.0, and then a third time with plot=True to just plot.

Here is the overall result (x-axis is number of unknowns of the linear system, y-axis is time it takes to solve in seconds):
Performance comparison of numpy.linalg.solve between v1.26.4 and v2.0.0.

Here is a zoom up to n=200:
Zoom on the beginning of the plot.

And here is the ratio time for 1.26.4 / time for 2.0.0 (the horizontal line is y=1 to see which is faster):
Ratio of the timings for versions 1.26.4 and 2.0.0. When the curve is above the horizontal line, it means that v2.0.0 is faster.

In the last image, one can see that there is a peak at about n=35 where we have 10x.

I can reproduce the problem on my machine with an AMD Zen2 CPU (Ryzen 7 5700U).

With Numpy 1.26.4, low-level profiling with perf shows that most of the time is spent in libopenblas64_p-r0-0cf96a72.3.23.dev.so and more specifically two functions : dgemm_kernel_ZEN and inner_advanced_thread. Note inner_advanced_thread seems to just run a loop reading the same variable over and over so it is certainly a kind of spinning wait (it likely wait for the threads to complete).

With Numpy 2.1.1, perf shows that most of the time is spent in libscipy_openblas64_-ff651d7f.so and more specifically two functions : blas_thread_server, dgemm_kernel_HASWELL and dcopy_k_HASWELL. Note blas_thread_server seems to just run a loop running rdtsc and nop instructions so it certainly don’t compute anything but just wait for threads to be ready, similar to inner_advanced_thread. Also note that the copy uses SSE instructions while my CPU supports AVX2 so it is likely sub-optimal.

We can see two important points:

The version of OpenBLAS changed from Numpy 1.26 to Numpy 2.1;
Numpy 1.26 calls dgemm BLAS functions optimized for my Zen CPU, while Numpy 2.1 calls one optimized for Intel Haswell CPUs likely less optimized for my Zen CPU. It also performs an expensive copy.

While the former point is expected when upgrading Numpy to a new major version (though it can impact performance), the later clearly seems like a packaging bug : the new linked BLAS library is not as optimized as the old one for the target platform.

We can see that the new OpenBLAS library mentions Scipy while the previous one does not. I think the problem is that Numpy now use the BLAS required by Scipy which is not as good as the version they packaged directly in Numpy (more specifically, the Scipy version looks like a generic one and the performance of BLAS libraries is highly dependent of the target CPU features used).

Update

With Numpy 2.0.0, dgemm_kernel_ZEN is called and the results seem close to the one of Numpy 2.1.1. Thus, the problem is finally not the architecture specialization. <– However, I still see a quite expensive copy (dcopy_k_ZEN). Other minor functions still take more time than in version 1.26.4 (similar to version 2.1.1).–> Cores are clearly less well used in the version 2.0.0 and 2.1.1 than 1.26.4 possibly coming from a inefficient OpenMP scheduling (only set for the Scipy version). In fact, computation is nearly sequential with Numpy 2.0.0! The target library is still the one of Scipy. The use of the Scipy library is still the main difference between version 1.26.4 and 2.0.0.

If we use only one core, the version 1.26.4 and 2.0.0 are equally fast. This can be done with OMP_NUM_THREADS=1. It confirms the problem comes from the poor scalability of the Scipy version. That being said, tuning OMP_SCHEDULE has no visible effect (this typically happen if developers set the scheduling in the code, or the scheduling may not be the source of the scalability issue).

Notes

Note that there is an open issue for Intel CPUs having a big-little architecture (i.e. recent Intel CPUs). A similar issue might happen on other big-little CPUs (eg. AFAIK some Apple M1/M2/M3 CPUs)

Note you can also track the loaded libraries (including BLAS) with strace on Linux. More specifically, you can list the (sorted) dynamic dependencies with the following command:

strace -e trace=open,openat python your_script.py 2>&1 | grep -E ""[^"]*.so"" -o | sort -u

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 19:42

Thẻ: pythonnumpyperformance

Thiết kế website giá rẻ

Danh mục

How come the performance of numpy.linalg.solve has changed drastically between versions 1.26.4 and 2.0.0?

Update

Notes