I made a RAG app that basically answers user questions based on provided data, it works fine on GPU and a single GPU. I want to deploy it on multiple GPUs (4 T4s) but I always get CUDA out of Memory error on pipeline.
I tried using “auto” keyword too but Langchain does not let me use it as keyword.
I used Langchain as main framework, my code looks like this:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline, HuggingFaceEmbeddings
MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.3"
pipe = HuggingFacePipeline.from_model_id(
model_id=MODEL_NAME,
device=0,
model_kwargs={"torch_dtype":torch.float16},
task="text-generation")
llm = ChatHuggingFace(llm=pipe)
embedding = HuggingFaceEmbeddings(model_name=MODEL_NAME,
model_kwargs={"device":"cuda:1"},
multi_process=True,
)