Relative Content

Tag Archive for pythonpytorchdistributed-trainingdeepspeed

deepspeed GPU memory not balanced

i was fine tuning a microsoft/deberta-v3-large classification model using deepspeed, I use a linux os, nvidia V100*8 to fine tune, max sequence is set to 1024, and here is my deepseed json config, when trainig, the GPU utility does not gose balanced, as you can see below, the rank:5 GPU gets more GPU, and i cannot using a bigger batch size for training, I want to know: