I have a chatbot web app with the following components :
- a frontend
- a FastAPI back-end that handles requests
- a vLLM api_server running a local Llama model on a H100.
The architecture is the following :
- front queries back.
- Back does a bit of preprocessing then queries the vLLM server with stream parameter.
- Back then listen to vllm tokens streaming responses and stream it himself back to the front-end using FastAPI.StreamingResponse.
- Front display tokens as they are generated.
Everything works fine when it’s one request at a time. Problems starts when we’re having multiples requests.
The fastAPI back-end is all working asynchronously to handle concurrent requests. The problem is whenever he starts listening to vLLM, the stream of tokens is so fast that he is never giving back ressources to handle other incoming requests. The second request is only sent when the first one is done and that can be long.
This is how I query and listen to vllm streaming answer :
async def call_infer_llm(request: LlmRequestModel):
data = {
"model": "/usr/Workplace/models/llama3-8b-Instruct/",
"messages": request.messages,
"temperature": request.temperature,
"top_p":request.top_p,
"stream":true,
}
async with httpx.AsyncClient() as client:
async with client.stream('POST', URL+'/v1/chat/completions', data=json.dumps(data)) as resp:
async for r in resp.aiter_bytes():
text = r.decode('utf-8')
#asyncio.sleep(0.05)
new_tokens = json.loads(text.split('data:')[1])
yield {'tokens':new_tokens['choices'][0]['delta'].get('content','')}
I know this is not a vLLM issue because during a request on the webapp, I can easily query 10’s of request to the vllm endpoint using curl and get answers quickly. So the problem must be coming from FastAPI creating a bottleneck with the StreamingResponse.
I wonder how I could bypass this bottleneck problem, this is the solutions I tried that didn’t work :
- implements an asyncio.sleep(0.05) in the backend listening loop to force giving back resources to other requests.
Options that I don’t like but might end up trying :
- Modifying the vllm api_server.py file in order to send batch of generated tokens to lower the streaming flow
- Making the front end querying vllm directly (and querying back before/after for preprocessing and postprocessing)