Loss function showing ‘nan’ when trying to run across multiple GPUs in Pytorch lightning
I’m trying to implement Pytorch lightning into a model I have written to solve the Allen-cahn diffusion equation. When I run the code on one gpu, the program runs as expected, but when I try to use DDP to run the code across more than one gpu (I have 8 accessible to me), I get nan, but only every few iterations. Also, if I run my code for more than 1500 epochs, VS code crashes. Any suggestions on what I can do to fix it?