i was fine tuning a microsoft/deberta-v3-large
classification model using deepspeed
, I use a linux os, nvidia V100*8 to fine tune, max sequence is set to 1024, and here is my deepseed json config, when trainig, the GPU utility does not gose balanced, as you can see below, the rank:5 GPU gets more GPU, and i cannot using a bigger batch size for training, I want to know:
-
- why the GPU memory is not balanced ? deepspeed zero-2 supports splits for optimizer states and gradients,
-
- why rank 5 not other GPUs, i have tried several times, always rank 5 GPU uses more GPU memory.
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1,
"consecutive_hysteresis": false
},
"optimizer": {
"type": "AdamW",
"torch_adam": true,
"params": {
"lr": 5e-5
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": 2,
"steps_per_print": 1e5,
"wandb": {
"enabled": false,
"team": "deepspeed",
"group": "competitions",
"project": "aes"
}
}
i have tried using float32
instead float16
, i get a CUDA OOM error