I need to run mutliple tensorflow training tasks for moleuclar dynamic. I find that a traning task usually takes only 8G of VRAM during the training which is only 10% of A100 ‘s 80G VRAM.
So I think that it maybe more efficient to run mutliple tasks at the sametime but it turns out not as what I expected. Turning out that the total time of run each task sequentially and parallel have no significient different,.
For exmaple, the sequential one script is
export CUDA_VISIBLE_DEVICES=0
dp train input.1.json
dp train input.2.json
dp train input.3.json
dp train.input.4.json
and the parallel version is
export CUDA_VISIBLE_DEVICES=0
dp train input.1.json &
dp train input.2.json &
dp train input.3.json &
dp train.input.4.json &
wait
I find that Nvidia has MIG features which can break a single GPU into virutal ones. But it requires root permission and its not possible to use on a HPC cluster.
Is there anything I can do to make those tasks run faster on a single A100?