I have a Flask application that uses a RAG pipeline in the background, and I stream responses from vLLM. I’m currently using a function to parse and yield the streamed JSON responses. However, I’m facing an issue where the streaming output includes previously generated text, rather than just the most recent content
def parse_json_stream(line):
decoded_line = line.decode('utf-8')
decoder = json.JSONDecoder()
pos = 0
while pos < len(decoded_line):
try:
result, json_end = decoder.raw_decode(decoded_line[pos:])
if "text" in result:
print(result["text"]) # debugging
if result["text"]:
yield result["text"][0].encode("utf-8")
pos += json_end
except json.JSONDecodeError:
# if can't decode JSON, go next character
pos += 1
And here’s how I print the response in Flask using this function:
I want to ensure that only the most recent text from the LLM is yielded, excluding any previously generated text.
def generate():
try:
# We can use prompt_len to slice down the text, not using it right now
prompt_len, response = process_and_respond(file_path, question)
print("********* Generate Function **********")
for line in response.iter_lines():
if line:
generator = parse_json_stream(line)
for parsed_text in generator:
if parsed_text:
yield parsed_text
finally:
print(f"Deleting File {file_path}")
os.remove(file_path)