Issue when saving/loading custom tf.keras.Model
I am building an autoencoder implemented as a custom tf.keras.Model. While the model after training performs well, I haven’t been able to save it and reload it properly. I have tried both model.save() method and save_weights() but in both case the model fails completely to perform its task.
First epoch is processing dataset twice but other epochs are not when doing distributed Training with Tensorflow & model.fit
On a full production scale this seems to be the root cause of GPU OOM
error which happens during the first epoch. See fully working code and logs below, why is Epoch 1/3
going through the dataset twice and not behaving like epochs 2 & 3?