So the situation is like below :
- I am trying to rerank the resultsets using the CrossEncoder model
bge-reranker-large
and for each resultsets it tries to load the model into memory (in case ofcpu
devide type and in GPU vram in case of device type iscuda
) - Now for each request, it tries to load the same model multiple times to perform the reranking
Now the question: Is it possible to load the model just once into GPU VRAM if its not loaded already and perform the reranking using that for the subsequent requests. This can substantially improve my app’s performance and save some GPU VRAM to perform other tasks.