I have a fine-tuned language model that I’m testing in a Streamlit application. While the model runs without issues on Google Colab even with a batch size of 1, it fails with a CUDA Out of Memory error when I try to run it on Streamlit. Here are the details:
the model i fine-tuned on a json file of textes to be able to analyse them:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
token = os.getenv("HF_TOKEN")
)
i tested the model on a text on collab it works it gives answer but when using it on streamlit ran on collab too the first page is loaded but when i enter the text and press the buttun to analyse it shows this :
- Reducing batch size to 1, which didn’t help.
2-quantization already used
3-model is saved after the training,its working on collab exemples yet gives cuda out of memory error when running it with streamlit
I am looking for suggestions on how to resolve the CUDA memory issues when running this model in Streamlit on Colab. Any insights on memory management or configuration tweaks that could help mitigate this issue would be greatly appreciated.
Youssra Farissi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.