CUDA out of memory while using Llama3.1-8B for inference
I have written a simple Python script that uses the HuggingFace transformers
library along with torch
to run Llama3.1-8B-instruct
purely for inference, after feeding in some long-ish bits of text (about 10k-20k tokens). It runs fine on my laptop, which has a GPU with 12GB RAM, but can also access up to 28GB total (I guess from the main system RAM?)
CUDA out of memory while using Llama3.1-8B for inference
I have written a simple Python script that uses the HuggingFace transformers
library along with torch
to run Llama3.1-8B-instruct
purely for inference, after feeding in some long-ish bits of text (about 10k-20k tokens). It runs fine on my laptop, which has a GPU with 12GB RAM, but can also access up to 28GB total (I guess from the main system RAM?)
CUDA out of memory while using Llama3.1-8B for inference
I have written a simple Python script that uses the HuggingFace transformers
library along with torch
to run Llama3.1-8B-instruct
purely for inference, after feeding in some long-ish bits of text (about 10k-20k tokens). It runs fine on my laptop, which has a GPU with 12GB RAM, but can also access up to 28GB total (I guess from the main system RAM?)