I’m working on a project using the SBERT pre-trained models (specifically MiniLM) for a text classification project with 995 classifications. I am following the steps laid out here for the most part and everything seems to run.
My issue occurs when actually training the model. No matter what values I set in the training arguments the training always seems to end early and never completes all the batches. For example, I set num_train_epochs=1
but it only gets up to 0.49 epochs. If num_train_epochs=4
, it always ends at 3.49 epochs.
Here is my code:
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.losses import BatchAllTripletLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator
model = SentenceTransformer(
"nreimers/MiniLM-L6-H384-uncased",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="all-MiniLM-L6-v2",
)
)
loss = BatchAllTripletLoss(model)
# Loss overview: https://www.sbert.net/docs/sentence_transformer/loss_overview.html
# This particular loss method: https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#batchalltripletloss
# training args: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="finetune/model20240924",
# Optional training parameters:
num_train_epochs=1,
max_steps = -1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=1e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.GROUP_BY_LABEL, #
# Optional tracking/debugging parameters:
eval_strategy="no",
eval_steps=100,
save_strategy="epoch",
# save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name="miniLm-triplet", # Will be used in W&B if `wandb` is installed
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=trainDataset,
eval_dataset=devDataset,
loss=loss,
#evaluator=dev_evaluator,
)
trainer.train()
Note that I am not using an evaluator because we are creating the model and testing it after the fact with a dedicated test set of values. My dataset is structured as:
Dataset({
features: ['Title', 'Body', 'label'],
num_rows: 23961
})
with the dev
dataset being the same structure, only with fewer rows. This gives the following output:
[1473/2996 57:06 < 59:07, 0.43 it/s, Epoch 0/1]
Step Training Loss
100 1.265600
200 0.702700
300 0.633900
400 0.505200
500 0.481900
600 0.306800
700 0.535600
800 0.369800
900 0.265400
1000 0.345300
1100 0.516700
1200 0.372600
1300 0.392300
1400 0.421900
TrainOutput(global_step=1473, training_loss=0.5003972503496366, metrics={'train_runtime': 3427.9198, 'train_samples_per_second': 6.99, 'train_steps_per_second': 0.874, 'total_flos': 0.0, 'train_loss': 0.5003972503496366, 'epoch': 0.4916555407209613})
As much as I adjust the values I cannot get it to complete all of the batches. How to resolve this issue?
0
I changed the training argument’s value for batch_sampler from
batch_sampler=BatchSamplers.GROUP_BY_LABEL
to
batch_sampler=BatchSamplers.NO_DUPLICATES
and the problem was solved. Originally GROUP_BY_LABEL
was selected as the documentation for this loss calculation recommended it but switching it seems to have cleared this up.