Thiết kế website giá rẻ

Question

I am running a transformer model from hugging face via Google Colab GPU T4 and below is my code. There are 5 fold cross validations and each cross validation goes through 4 epoch (First cross validation: epoch 1,2,3,4 then second cross validation: epoch 1,2,3,4 etc). As you can see in the code, I have included checkpoints for each epoch.

def run_cross_validation(model_name='DistilRoBerta',
X=X,
y=y,
splits=5,
epoch=4,
checkpoint=False):

    kfold = StratifiedShuffleSplit(n_splits=splits, test_size=0.2, random_state=6617)
    # kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=1127)
    n_fold = 1
    
    print("Developing Model with Cross validation for: " + model_name)
    for train, test in tqdm(kfold.split(X, y)):
    
        print("Running for Fold: ",n_fold)
        train_index = list(train)
        test_index = list(test)
    
        X_train = [X[i] for i in train_index]
        y_train = [y[i] for i in train_index]
        X_val = [X[i] for i in test_index]
        y_val = [y[i] for i in test_index]
    
        # Tokenize
        X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
        X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
    
        # Create torch dataset
        train_dataset = Dataset(X_train_tokenized, y_train)
        val_dataset = Dataset(X_val_tokenized, y_val)
    
        # Fine Tune Transformer
        # Define Trainer
        args = TrainingArguments(
            output_dir="/content/drive/My Drive/output_" + model_name + "/fold"+str(n_fold),
            evaluation_strategy="epoch",
            save_strategy="epoch",
            #eval_steps=500,
            #per_device_train_batch_size=1,
            #per_device_eval_batch_size=1,
            num_train_epochs=epoch, #1 was okay
            seed=1127,
            load_best_model_at_end=True,
        )
    
        trainer = Trainer(
        # model_init=model_init,
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
        )
    
        trainer.train(resume_from_checkpoint=checkpoint)
        print("Complete for fold", n_fold)
        n_fold= n_fold + 1`

However when the first cross validation fold has completely saved the checkpoints of all 4 epoch in my Google Drive, the system somehow crashed in the middle of the second fold cross validation epoch 1 and I decided to continue from the 4th epoch checkpoint of the first fold cross-validation.

run_cross_validation(model_name='DistilRoBerta',X=X,y=y,splits=5,epoch=4,checkpoint=True)

It’s supposed to check whether the 4th epoch in the first fold cross-validation has been completed and then run the 1st epoch of the second cross validation. However, why does it start by running the checkpoint in the second fold cross validation and came out with no checkpoint detected (of course since epoch 1 has not run yet) instead of creating the epoch 1? The same problem occurs when I changed from checkpoint=True to checkpoint=/content/drive/My Drive/output_DistilRoBerta/fold1/checkpoint-32952.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-448ffae712de> in <cell line: 1>()
1 run_cross_validation(model_name='DistilRoBerta',
2                          X=X,
3                          y=y,
4                          splits=5,
5                          epoch=4,

1 frames
#<ipython-input-6-3c7143db8bb6> in run_cross_validation(model_name, X, y, splits, epoch, checkpoint)
199         )
200 
201         trainer.train(resume_from_checkpoint=checkpoint)
202         print("Complete for fold", n_fold)
203         n_fold= n_fold + 1

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1746             resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1747             if resume_from_checkpoint is None:
1748                 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1749 
1750         if resume_from_checkpoint is not None:

ValueError: No valid checkpoint found in output directory (/content/drive/My Drive/output_DistilRoBerta/fold2)

View of my epoch checkpoints of first fold cross validation in GDrive

I appreciate any help, thanks in advance!

Danh mục