Currently, when I’m training a net it will find a region where its learning step is too large, resulting in losses that are either NaN
or Inf
. Sometimes I can train for 100’s of epochs without this ever happening, sometimes it happens after 50-100 epochs.
My solution was to checkpoint the last few versions of my model and as soon as I got an Inf
or NaN
loss, revert to an earlier model, set my learning_rate to half of its current size and continue.
However, when I do so, I run out of memory. I’m trying to reload checkpointed models in this way:
# Initialize checkpointing
checkpoint = tf.train.Checkpoint(model=self.my_model) # Track model
checkpoint_manager = tf.train.CheckpointManager(checkpoint,
directory='./checkpoints',
max_to_keep=10)
I then try to load a resored model as follows:
# Restore model to earlier checkpoint
all_checkpoints = checkpoint_manager.checkpoints
checkpoint_to_restore = all_checkpoints[-3]
checkpoint.restore(checkpoint_to_restore)
This seems pretty similar to the example given at Colab Checkpoint Example. I don’t know what else to try, other than as a last resort, stopping my process and restarting it before loading the checkpointed net. Is there something else I can/should do?