I am currently training a DeepLabv3 variant with EfficientNet as a backbone for semantic segmentation, on tf 2.16.1 with an RTX 4090. On some training instances, at a single random epoch (never the same but often around the 40th), I get a big spike in training loss (SparseCategoricalCrossEntropy), creating an even bigger spike in validation loss. It disappears a few epochs later without seemingly impacting much the rest of the training.
See example: Training Loss Training IoU Validation Loss Validation IoU.
I get these spikes 80% of the times I try to train the model locally on the 4090. Does anyone have an idea where it can come from?
I tried using different backbones (EffNet B1 to B7, Resnet) for the encoder and I tried different batch sizes from 8 to 32, but the spikes will happen almost all of the time with the 4090.
I tried training on Collab with the T4 GPU a few times and never got any spikes, but very similar end results for the model, which means the spikes don’t really impact the performances.
user24949337 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.