I want to speed up R codes that run on WSL2 (Windows Subsystem for Linux 2) by using NVBLAS. I have Ubuntu 24.04 LTS as a WSL2 environment, and there I have CUDA 12.1 and a GPU, NVIDIA RTX 3060 (12 GB of dedicated memory).
I did the steps shown in MWE section of this post below to compare the running speeds between with and without NVBLAS, following a blog post in Japanese NVBLASを使って「R」の並列演算処理を高速化|NTTPCのGPU+|NVIDIA Eliteパートナー. However, the execution time using NVBLAS took longer (110.12 sec) than that without NVBLAS (86.70 sec). Could anyone provide insights on what might be missing or misconfigured, such as incorrect path of NVBLAS, OpenBLAS, or both? Or should I use CUBLAS rather than NVBLAS?
MWE
- Run the following bash command to find the location of LIBBLAS
find /usr -name libblas.so # /usr/lib/x86_64-linux-gnu/libblas.so # /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so # <- I used this one for the `/etc/nvblas.conf` as shown below
- Make
/etc/nvblas.conf
with the contents shown below:NVBLAS_LOGFILE nvblas.log NVBLAS_TRACE_LOG_ENABLED NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so NVBLAS_GPU_LIST ALL NVBLAS_TILE_DIM 2048 NVBLAS_AUTOPIN_MEM_ENABLED
- Run the following bash command to find the location of NVBLAS
find /usr -name libnvblas.so # /usr/local/cuda-12.1/targets/x86_64-linux/lib/libnvblas.so
- Execute the following bash commands:
# Move to tmp directory cd /tmp # Download a benchmark script wget http://r.research.att.com/benchmarks/R-benchmark-25.R # Run the benchmark test without NVBLAS cat R-benchmark-25.R | time R --slave # Run the benchmark test with NVBLAS cat R-benchmark-25.R | LD_PRELOAD=/usr/local/cuda-12.1/targets/x86_64-linux/lib/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf time R --slave
Result of the benchmark test without NVBLAS
$ cat R-benchmark-25.R | time R --slave
Loading required package: Matrix
Loading required package: SuppDists
Warning messages:
1: In remove("a", "b") : object 'a' not found
2: In remove("a", "b") : object 'b' not found
R Benchmark 2.5
===============
Number of times each test is run__________________________: 3
I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 0.562333333333333
2400x2400 normal distributed random matrix ^1000____ (sec): 0.172666666666667
Sorting of 7,000,000 random values__________________ (sec): 0.53
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.0686666666666665
Linear regr. over a 3000x3000 matrix (c = a b')___ (sec): 0.124666666666667
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.225118700970229
II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 0.322666666666667
Eigenvalues of a 640x640 random matrix______________ (sec): 0.715666666666667
Determinant of a 2500x2500 random matrix____________ (sec): -0.0256666666666672
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.0886666666666673
Inverse of a 1600x1600 random matrix________________ (sec): 0.160333333333333
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.166154752830703
III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.161
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.207333333333333
Grand common divisors of 400,000 pairs (recursion)__ (sec): 0.158333333333334
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.0279999999999999
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.198
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.171535682069976
Total time for all 15 tests_________________________ (sec): 3.47266666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.18582020669742
--- End of test ---
86.70user 133.15system 0:27.29elapsed 805%CPU (0avgtext+0avgdata 689512maxresident)k
17056inputs+0outputs (120major+123284minor)pagefaults 0swaps
Result of the benchmark test with NVBLAS
$ cat R-benchmark-25.R | LD_PRELOAD=/usr/local/cuda-12.1/targets/x86_64-linux/lib/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf time R --slave
Loading required package: Matrix
Loading required package: SuppDists
Warning messages:
1: In remove("a", "b") : object 'a' not found
2: In remove("a", "b") : object 'b' not found
R Benchmark 2.5
===============
Number of times each test is run__________________________: 3
I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 0.545
2400x2400 normal distributed random matrix ^1000____ (sec): 0.163666666666667
Sorting of 7,000,000 random values__________________ (sec): 0.573
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.709999999999999
Linear regr. over a 3000x3000 matrix (c = a b')___ (sec): 0.282999999999999
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.445429715785632
II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 0.131666666666667
Eigenvalues of a 640x640 random matrix______________ (sec): 1.068
Determinant of a 2500x2500 random matrix____________ (sec): 0.197666666666665
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.0990000000000002
Inverse of a 1600x1600 random matrix________________ (sec): 1.24633333333333
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.302919229912292
III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.146333333333335
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.158999999999999
Grand common divisors of 400,000 pairs (recursion)__ (sec): 0.162
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.0300000000000011
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.203000000000003
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 0.155627067761755
Total time for all 15 tests_________________________ (sec): 5.71766666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.275886461182068
--- End of test ---
110.12user 228.86system 0:35.02elapsed 967%CPU (0avgtext+0avgdata 890968maxresident)k
79000inputs+1824outputs (468major+158333minor)pagefaults 0swaps
R session info
> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Tokyo
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.1
Appendix: The step I build a WSL2 environment with CUDA 12.1 and OpenBLAS
- Installed Ubuntu 24.04 on WSL2.
- Installed the latest NVIDIA Driver from NVIDIA’s website on the native Windows system.
- Verified that
libcuda.so
is only located in/usr/lib/wsl/lib/libcuda.so
usingfind /usr/ -name libcuda.so
at this step. - Followed the CUDA on WSL guide:
- Removed the existing key:
sudo apt-key del 7fa2af80
. - Executed the following commands as per the CUDA 12.1.1 installation guide:
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda
- Removed the existing key:
- Set the PATH:
echo 'export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc source ~/.bashrc
- Configured WSL not to inherit the Windows PATH:
- Edited
/etc/wsl.conf
to add:[interop] appendWindowsPath = false
- Executed the following in Windows PowerShell:
wsl.exe --shutdown
- Reboot Ubuntu 24.04
- Edited
- Verified that
libcuda.so
exists in/usr/lib/wsl/lib/libcuda.so
and/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so
.- This may conflict with the CUDA on WSL documentation.
- Installed essential build tools:
sudo apt -y install build-essential gcc g++ make libtool texinfo dpkg-dev pkg-config gfortran
- Installed OpenBLAS following the OpenBLAS Wiki because I also wanted to install it:
sudo apt update sudo apt install libopenblas-dev