I have ~100 tensorflow models to train, and on each training I run keras-tuner
to find the best hyperparameters for each model. To save time, I would like to train one of these models per CPU core.
However, I think that the parallel trainings are overwriting each other’s checkpoint, because when I run get_best_models()
I get the following traceback:
1720 tuner.search(X_train, y_train, epochs=config.DIRECTOR_EPOCHS,
1721 validation_split=0.15, callbacks=[callbacks],
1722 verbose=0)
1724 # Let's extract the director model from the best fitted model
-> 1725 best_model = tuner.get_best_models()[0]
1726 director = keras.Model(inputs=best_model.get_layer('_').input,
1727 outputs=[
1728 best_model.get_layer('__').output,
1729 best_model.get_layer('___').output
1730 ],
1731 name=f'director_h{h}')
1732 self.directors[f'director_h{h}'] = director
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginetuner.py:366, in Tuner.get_best_models(self, num_models)
348 """Returns the best model(s), as determined by the tuner's objective.
349
350 The models are loaded with the weights corresponding to
(...)
363 List of trained model instances sorted from the best to the worst.
364 """
365 # Method only exists in this class for the docstring override.
--> 366 return super().get_best_models(num_models)
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginebase_tuner.py:364, in BaseTuner.get_best_models(self, num_models)
349 """Returns the best model(s), as determined by the objective.
350
351 This method is for querying the models trained during the search.
(...)
361 List of trained models sorted from the best to the worst.
362 """
363 best_trials = self.oracle.get_best_trials(num_models)
--> 364 models = [self.load_model(trial) for trial in best_trials]
365 return models
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginebase_tuner.py:364, in <listcomp>(.0)
349 """Returns the best model(s), as determined by the objective.
350
351 This method is for querying the models trained during the search.
(...)
361 List of trained models sorted from the best to the worst.
362 """
363 best_trials = self.oracle.get_best_trials(num_models)
--> 364 models = [self.load_model(trial) for trial in best_trials]
365 return models
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginetuner.py:297, in Tuner.load_model(self, trial)
294 # Reload best checkpoint.
295 # Only load weights to avoid loading `custom_objects`.
296 with maybe_distribute(self.distribution_strategy):
--> 297 model.load_weights(self._get_checkpoint_fname(trial.trial_id))
298 return model
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskerasutilstraceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.__traceback__)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packagestensorflowpythontrainingpy_checkpoint_reader.py:31, in error_translator(e)
27 error_message = str(e)
28 if 'not found in checkpoint' in error_message or (
29 'Failed to find any '
30 'matching files for') in error_message:
---> 31 raise errors_impl.NotFoundError(None, None, error_message)
32 elif 'Sliced checkpoints are not supported' in error_message or (
33 'Data type '
34 'not '
35 'supported') in error_message:
36 raise errors_impl.UnimplementedError(None, None, error_message)
NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for tmpuntitled_projecttrial_09checkpoint
Interestingly, if I train two models on different hard disks, the error does not show up.
I tried looking up ways to rename the checkpoint files, but I couldn’t find one.