I am running a transformer model from hugging face via Google Colab GPU T4 and below is my code. There are 5 fold cross validations and each cross validation goes through 4 epoch (First cross validation: epoch 1,2,3,4 then second cross validation: epoch 1,2,3,4 etc). As you can see in the code, I have included checkpoints for each epoch.
def run_cross_validation(model_name='DistilRoBerta',
X=X,
y=y,
splits=5,
epoch=4,
checkpoint=False):
kfold = StratifiedShuffleSplit(n_splits=splits, test_size=0.2, random_state=6617)
# kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=1127)
n_fold = 1
print("Developing Model with Cross validation for: " + model_name)
for train, test in tqdm(kfold.split(X, y)):
print("Running for Fold: ",n_fold)
train_index = list(train)
test_index = list(test)
X_train = [X[i] for i in train_index]
y_train = [y[i] for i in train_index]
X_val = [X[i] for i in test_index]
y_val = [y[i] for i in test_index]
# Tokenize
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
# Create torch dataset
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)
# Fine Tune Transformer
# Define Trainer
args = TrainingArguments(
output_dir="/content/drive/My Drive/output_" + model_name + "/fold"+str(n_fold),
evaluation_strategy="epoch",
save_strategy="epoch",
#eval_steps=500,
#per_device_train_batch_size=1,
#per_device_eval_batch_size=1,
num_train_epochs=epoch, #1 was okay
seed=1127,
load_best_model_at_end=True,
)
trainer = Trainer(
# model_init=model_init,
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train(resume_from_checkpoint=checkpoint)
print("Complete for fold", n_fold)
n_fold= n_fold + 1`
However when the first cross validation fold has completely saved the checkpoints of all 4 epoch in my Google Drive, the system somehow crashed in the middle of the second fold cross validation epoch 1 and I decided to continue from the 4th epoch checkpoint of the first fold cross-validation.
run_cross_validation(model_name='DistilRoBerta',X=X,y=y,splits=5,epoch=4,checkpoint=True)
It’s supposed to check whether the 4th epoch in the first fold cross-validation has been completed and then run the 1st epoch of the second cross validation. However, why does it start by running the checkpoint in the second fold cross validation and came out with no checkpoint detected (of course since epoch 1 has not run yet) instead of creating the epoch 1? The same problem occurs when I changed from checkpoint=True to checkpoint=/content/drive/My Drive/output_DistilRoBerta/fold1/checkpoint-32952.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-448ffae712de> in <cell line: 1>()
1 run_cross_validation(model_name='DistilRoBerta',
2 X=X,
3 y=y,
4 splits=5,
5 epoch=4,
1 frames
#<ipython-input-6-3c7143db8bb6> in run_cross_validation(model_name, X, y, splits, epoch, checkpoint)
199 )
200
201 trainer.train(resume_from_checkpoint=checkpoint)
202 print("Complete for fold", n_fold)
203 n_fold= n_fold + 1
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1746 resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1747 if resume_from_checkpoint is None:
1748 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1749
1750 if resume_from_checkpoint is not None:
ValueError: No valid checkpoint found in output directory (/content/drive/My Drive/output_DistilRoBerta/fold2)
View of my epoch checkpoints of first fold cross validation in GDrive
I appreciate any help, thanks in advance!