Getting NaN training and validation loss when training BERT model on pytorch
I am training a pretrained BERT model for a NER task. When I configured the device to cuda, it causes the gradients to backpropagate and output as NaNs. This does not happen when the device is configured on cpu or mps(I am using Mac M1 chip). I am not sure what could be the reason behind my code that would have caused it. Can anyone offer advice to point me on the right direction for this?