I have two Kubernetes clusters that utilize GPUs. Both clusters were operating correctly until recently, when pods in one of the clusters began to get stuck in the “starting” state during deployment.
When describing the pods, I receive the following error message:
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: requirement error: unsatisfied condition: cuda>=12.3, please update your driver to a newer version, or use an earlier cuda container: unknown
I verified the GPU driver and CUDA versions using the nvidia-smi
command, which shows:
NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6
Interestingly, my other Kubernetes cluster is running without issues with the same NVIDIA driver and CUDA version.
Here are my questions:
Why would this error arise suddenly when the setup was previously working with the same versions? How can I resolve the error that requires CUDA version >=12.3, given that my driver and CUDA versions have not changed recently?
Is there a way to force the pod to use the existing CUDA version without upgrading, considering the other cluster works fine under the same conditions?