I’m trying to train a fairly standard machine translation transformer model using PyTorch. It’s based on the “Attention is All You Need” paper. When I ran it on my PC with standard hyperparameters and a batch size of 128 segments (pairs of source and target language sentences), it worked fine but was slow, as expected.
Now, I’m running it on an AWS p2.xlarge instance with a Tesla K80 GPU, and the program crashes quickly due to GPU memory overflow. I’ve tried everything to free up GPU memory, but I’ve had to reduce the batch size to 8, which is obviously inefficient for learning.
Even with a batch size of 8, I occasionally get this error message:
File
“C:ProjectsMT004.venvLibsite-packagestorchautogradgraph.py”,
line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate
1.95 GiB. GPU
I’ve tried both SpaCy’s tokenizer and the XLM-R tokenizer. With the XLM-R tokenizer, I can only use a batch size of 2, and even then, it sometimes crashes.
Unfortunately, I cannot use a bigger server since I don’t have enough quota on EC2.
Any idea what I might be doing wrong? Any suggestions on how to optimize things?