I’m trying to use GPU Time sharing from instructions here, however my workloads will not run on the time-sharing enabled nodes.
I have a node pool with GPU configuration, GPU sharing enabled with Time-sharing for strategy and “Max shared clients per GPU” as 48. The node(s) run fine but I’m unable to run workloads on them using the documented nodeSelector
config for my workload, e.g.
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
cloud.google.com/gke-max-shared-clients-per-gpu: "48"
cloud.google.com/gke-gpu-sharing-strategy: time-sharing
With this my pods get stuck in pending status with message x nodes didn't match Pod's node affinity/selector
. If I remove the gke-max-shared-clients-per-gpu
and gke-gpu-sharing-strategy
key pairs, the pod schedules and runs fine.
When I check the kubernetes labels on the nodes in the gpu time sharing node pool, they do NOT include these labels and I can’t add them manually because GCP prevents it.
Any suggestions are appreciated!