I want to use multiple GPUs to train a model, because there is not enough memory in a single GPU card for my training.
To do this, I tried following these instructions: Multi-GPU distributed training with TensorFlow
I an trying to do this using segmentation_models, and the training function is as follows:
def run_training(BACKBONE, n_classes, activation, metrics, EPOCHS, X_train, X_test, y_train_cat, y_test_cat):
# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print(‘Number of devices: {}’.format(strategy.num_replicas_in_sync))
# Open a strategy scope and create/restore the model
with strategy.scope():
model = sm.Unet(BACKBONE, encoder_weights='imagenet', classes=n_classes, activation=activation)
weights = [0,.1,.45,.45]
loss = weighted_categorical_crossentropy(weights)
model.compile(optimizer='adam', loss=loss, metrics=metrics)
#model1.compile(tf.keras.optimizers.legacy.Adam(), loss=loss, metrics=metrics)
history=model.fit(X_train,
y_train_cat,
batch_size=4,
epochs=EPOCHS,
verbose=1,
validation_data=(X_test, y_test_cat))
return history, model
Where BACKBONE is one of the backbones supported by segmentation_models.
This code worked when I was not trying the distributed processing strategy. However, then I run it with the distributed processing strategy, it runs, and goes through the training, but I get invalid results.
Here is an example of the IOU vs epochs output.
When not using distributed processing, the validation IOU rises to around 0.5 or 0.6.
In other cases, I saw the training and val IOU go to 0.5 or 0.6667 after a few epochs and stay constant .
What am I doing wrong, and can I use the segmentation_models package with multiple GPUs?
3