I’m trying to implement Pytorch lightning into a model I have written to solve the Allen-cahn diffusion equation. When I run the code on one gpu, the program runs as expected, but when I try to use DDP to run the code across more than one gpu (I have 8 accessible to me), I get nan, but only every few iterations. Also, if I run my code for more than 1500 epochs, VS code crashes. Any suggestions on what I can do to fix it?
I have tried changing the learning rate and checking values at every step, and it seems that the GPUs are out of sync with each other, meaning to say that at this step `
# initial condition loss
mask = (t == 0.0)
u_int = x[mask]**2 * torch.cos(torch.pi*x[mask])
print(u[mask])
IC_loss = loss_fun(u[mask],u_int)`
, u[mask] is empty