I’m currently fine-tuning a GPT-2 distilled model with approximately 130M parameters on 8 years’ worth of my WhatsApp chats. I’ve prepared a labeled dataset consisting of around 1 million sequences, each with 512 tokens. Below is the portion of code I am using for training:
def train_epoch(model, inputs, labels, batch_size, optimizer, scheduler, device):
model.train()
total_loss = 0
iter = int(len(inputs) / batch_size)
scaler = GradScaler()
inputs=inputs.to(device)
labels=labels.to(device)
for i in range(iter):
print(f"Iter {i + 1} of {iter}")
inputs_batch= inputs[i*batch_size:(i+1)*batch_size]
labels_batch = labels[i * batch_size:(i + 1) * batch_size]
inputs_batch, labels_batch = inputs_batch.to(device), labels_batch.to(device)
optimizer.zero_grad()
print(inputs_batch.shape)
with autocast("cuda"):
outputs = model(inputs_batch, labels=labels_batch)
loss = outputs.loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
total_loss += loss.item()
optimizer.step()
scheduler.step()
avg_loss = total_loss / iter
print(f"Epoch average loss: {avg_loss}")
learning_rate = 5e-5
batch_size = 16
epochs = 100
model_name = 'GroNLP/gpt2-small-italian'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("GroNLP/gpt2-small-italian")
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = batch_size * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
inputs = train_dataset[0]
labels = train_dataset[1]
for epoch in range(epochs):
print(f"Epoch {epoch + 1} of {epochs}")
indices = torch.randperm(len(train_dataset[0]))
inputs= inputs[indices]
labels = labels[indices]
train_epoch(model, inputs, labels, batch_size, optimizer, scheduler, device)
The training process is taking an extraordinarily long time—hundreds of hours in total. I’ve tried various methods to get the code to use my GPU, but without success. I’ve checked the compatibility of PyTorch and TensorFlow versions with CUDA multiple times. CUDA is available, the version is correct, and PyTorch recognizes my GPU. I can see that GPU memory is being utilized during training, so the issue might be with the code itself. Since this is my first project, the problem might be obvious to experienced users like you but not to me.
9