I’m attempting to use ssh into a high performance cluster in vscode. I am able to properly connect & allocate myself computer nodes.
I connect using this command in the terminal:
srun --partition a100_dev --gres gpu:a100:2 --cpus-per-task 24 --mem 240G --ntasks-per-node 1 --nodes 1 --time 1:00:00 --pty bash
After doing so, in the terminal, running the bash command nvidia-smi
results in the attached screen shot (correct output as far as I know).
Output
However, when I run !nvidia-smi
in a kernal, I get an error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
It seems to me like they are running in the same environment, so I am unsure as to what is causing this error.
How can I fix this issue?
I’ve tried restarting environment, manually setting up environmental variables & ensuring that they are all the same between the terminal environment and the kernal environment.
I am also told by other ppl that we are not to install any nvidia drivers on the hpc; that they are already there and should not be messed with.
Also, when I run nvcc –version on both the kernal and the terminal, I get the same correct output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Salty Spark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.