issue about KV cache in streaming(generation) (huggingface transformers)
I have encountered an issue while using model.generate() for llama3 conversations. When I set use_cache=True, it seems to store the cache from each invocation. I noticed that even when I don’t provide previous conversation context in the new chat_template, it still remembers the previous content, and it leads to memory overflow. How can I clear the data left by the last call of the generate function by calling any other functions of transformers??
I need to Clear the KV cache which I don’t need
Zheng Zhang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.