I am working on a rag app where I use LLMs to analyze various documents. I’m looking to improve the ux by streaming responses in real time.
a snippet of my code:
params = SamplingParams(temperature=TEMPERATURE,
min_tokens=128,
max_tokens=1024)
llm = LLM(MODEL_NAME,
tensor_parallel_size=4,
dtype="half",
gpu_memory_utilization=0.5,
max_model_len=27_000)
message = SYSTEM_PROMPT + "nn" + f"Question: {question}nnDocument: {document}"
response = llm.generate(message, params)
In its current form, generate
method waits untis the entire response is generated. I’d like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.
I was using vllm==0.5.0.post1
when I first wrote that code.
Does anyone have experience with implementing streaming for LLMs=Any
guidance or examples would be appreciated!