Thiết kế website giá rẻ

Question

I have ~100 tensorflow models to train, and on each training I run keras-tuner to find the best hyperparameters for each model. To save time, I would like to train one of these models per CPU core.

However, I think that the parallel trainings are overwriting each other’s checkpoint, because when I run get_best_models() I get the following traceback:

   1720 tuner.search(X_train, y_train, epochs=config.DIRECTOR_EPOCHS, 
   1721              validation_split=0.15, callbacks=[callbacks],
   1722              verbose=0)
   1724 # Let's extract the director model from the best fitted model
-> 1725 best_model = tuner.get_best_models()[0]
   1726 director = keras.Model(inputs=best_model.get_layer('_').input,
   1727                        outputs=[
   1728                            best_model.get_layer('__').output,
   1729                            best_model.get_layer('___').output
   1730                        ],
   1731                        name=f'director_h{h}')
   1732 self.directors[f'director_h{h}'] = director

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginetuner.py:366, in Tuner.get_best_models(self, num_models)
    348 """Returns the best model(s), as determined by the tuner's objective.
    349 
    350 The models are loaded with the weights corresponding to
   (...)
    363    List of trained model instances sorted from the best to the worst.
    364 """
    365 # Method only exists in this class for the docstring override.
--> 366 return super().get_best_models(num_models)

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginebase_tuner.py:364, in BaseTuner.get_best_models(self, num_models)
    349 """Returns the best model(s), as determined by the objective.
    350 
    351 This method is for querying the models trained during the search.
   (...)
    361    List of trained models sorted from the best to the worst.
    362 """
    363 best_trials = self.oracle.get_best_trials(num_models)
--> 364 models = [self.load_model(trial) for trial in best_trials]
    365 return models

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginebase_tuner.py:364, in <listcomp>(.0)
    349 """Returns the best model(s), as determined by the objective.
    350 
    351 This method is for querying the models trained during the search.
   (...)
    361    List of trained models sorted from the best to the worst.
    362 """
    363 best_trials = self.oracle.get_best_trials(num_models)
--> 364 models = [self.load_model(trial) for trial in best_trials]
    365 return models

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskeras_tunerenginetuner.py:297, in Tuner.load_model(self, trial)
    294 # Reload best checkpoint.
    295 # Only load weights to avoid loading `custom_objects`.
    296 with maybe_distribute(self.distribution_strategy):
--> 297     model.load_weights(self._get_checkpoint_fname(trial.trial_id))
    298 return model

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packageskerasutilstraceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File ~PycharmProjectsOrchestraomnienergy-orchestravenvlibsite-packagestensorflowpythontrainingpy_checkpoint_reader.py:31, in error_translator(e)
     27 error_message = str(e)
     28 if 'not found in checkpoint' in error_message or (
     29     'Failed to find any '
     30     'matching files for') in error_message:
---> 31   raise errors_impl.NotFoundError(None, None, error_message)
     32 elif 'Sliced checkpoints are not supported' in error_message or (
     33     'Data type '
     34     'not '
     35     'supported') in error_message:
     36   raise errors_impl.UnimplementedError(None, None, error_message)

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for tmpuntitled_projecttrial_09checkpoint

Interestingly, if I train two models on different hard disks, the error does not show up.

I tried looking up ways to rename the checkpoint files, but I couldn’t find one.

Thiết kế website giá rẻ

Danh mục

Missing checkpoint files when training multiple models at the same time in tensorflow