Out-of-memory problem when using dist.all_gather
I’m writing codes for multi-GPU training, and I need to gather embeddings from different gpus to calculate loss and then propagate the gradients back to different GPUs. However, when the programs runs to optimizer.step(), the memory usage increases dramatically and resulted in a out-of-memory problem. The code is as below, thanks!