Distributed Training using PyTorch
I am using PyTorch’s multiprocessing framework to distribute my training across multiple GPUs. I’m doing this over the batch size, so each GPU has its independent batch that it calculates the gradient over. I then average the gradients across the GPUs by using PyTorch’s all_reduce function.
However, the backward passes slow down significantly when compared to a single GPU training.