I was working on training a language model with keras in batches for 1.5 million words on a cluster, but working in jupyter notebook.
# Train the model using the data generator
history = model.fit(train_data_gen,
epochs=epochs,
steps_per_epoch=train_steps_per_epoch,
validation_data=val_data_gen,
validation_steps=val_steps_per_epoch,
callbacks=[checkpoint_callback, csv_logger, backup_restore, early_stopping, reduce_lr],
verbose=1)
After about 6 hours, it suddenly got stuck in the middle of an epoch. I’ve encountered this error before during those 6 hours, but then it just continued to train (and other times too for the same and other files). Also, now I still see the “*” appearing but I think I might have to start again from my intermediate save (also need to figure out how to do that and store the info, cus I want a graph showing the training and validation loss over time).
In any case, any ideas on how to fix the error? I’m wondering if it might also be my wifi connection.
I waited for about 20 minutes, but still nothing is happening, so I think I’ll try to work from an intermediate save. But I need to have a smooth process though because I am planning to train a model with 10x the current data size and I haven’t got much time.
1