When training a simple model the processes were crashed by RuntimeError: No grad accumulator for a saved leaf!
in loss.backward()
, but I make sure all the data that need to calculate gradients are put on GPUs.
def train_epoch(args, epoch, model, loss_fn, optim, dataloader, lr_scheduler=None, warmup_scheduler=None):
model.train()
dataloader.sampler.set_epoch(epoch)
mae_m, loss_m = AverageMeter(), AverageMeter()
calc_m, read_m = AverageMeter(), AverageMeter()
timer = Timer()
log_step = len(dataloader) // 11
if args.local_rank == 0:
args.writer.add_scalar('lr', optim.param_groups[0]['lr'], epoch)
mae_list, pred_list = [], []
for step, sample in enumerate(dataloader):
data, label = sample['data'].cuda().requires_grad_(), sample['label'].cuda()
read_m.add(timer.tiktok())
optim.zero_grad()
# (output, deep_output), attn = model(data)
output = model(data)
output = output.reshape(label.shape)
# loss = loss_fn(output, label) + loss_fn(deep_output, label)
loss = loss_fn(output, label)
loss.backward(retain_graph=True)
The codes for training are listed above. And the error of one of the processes is illustrated as below:
the error
I have asked GPT for help, followed its advice to have added .requires_grad_()
for data
(which I think is unnecessary) to make sure it will be calculated gradients, and I added retain_graph=True
for loss.backward()
. But still it didn’t work.
How to resolve this issue?
Lee_Litchi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.