When I trained my Mamba model on 4 GPUs through DistributedDataParallel, after the first round of training, I executed the validation code. The validation on cuda:3 process always gave Nan values, and Nan values also appeared in the subsequent training process. It is worth noting that my training is divided into four processes, and one of them is slow. When this process has not completed the first round of training, other processes have begun to perform model verification. I do not know whether this is the reason for the above phenomenon.
I checked the output of the model, and the first round of training did not have any Nan values until I started performing validation