My question might be a bit basic, but I’m new to all of this and eager to learn.
I have build an app with FastAPI. Previously I used asyncio method to handle multiple request to llm, but with each new request it become slower in response. So I decide to use vLLM method, but I got a problem now how to provide parallel or concurrent requests to vLLM when I have dealing with dozen or more users. Is there a way to call run in parallel for several inputs and receive valid results for each input?
I have a basic setup where I initialize an LLM using vLLM with Langchain RAG and the Mixtral model (specifically, Mixtral8x7b). Here’s what I do:
I define a system prompt and an instruction f
I create an RAG system
I then run the RAG system , which works for a single input and when I tested it with locust it works consistently.But I want it run parallel or concurrent.
Volodymyr Bondarenko is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.