Relative Content

Tag Archive for pythonmachine-learninghuggingfacemulti-gpu

HuggingFace accelerate device error when running evaluation

I am running some experiments on a multi-GPU cluster, and I’m using accelerate. I’m trying to calculate some metrics after every batch iteration in the training dataloader. While the training code seems to work fine using accelerate (it utilizes multiple GPUs), I run into an error when trying to calculate said metrics. It seems that after doing a forward pass when evaluating the output tensors are put on another device than the input tensors. The code that gives an error is the following: