I’m currently running the LLama 3.1:8B model using the Ollama Docker container. My context window has the following structure:
- Bot Personality
- Bot Directives
- Conversation (an array of messages)
I have a procedure that checks the total number of characters/tokens and trims the conversation (starting from the oldest messages) to fit within the context window. From what I’ve found online, the context window for LLama 3.1:8B is 128k tokens, which should equate to around 512,000 characters:more than enough for my use case.
However, I’m running into an issue where, after a certain number of messages, my bot starts losing its personality and directives. I suspect that Ollama might be truncating part of the context from the top. When I queried the bot, it mentioned having a 2048 character limit per message and trimming from the top when exceeding it (which would explain the problem).
Here are my questions:
- Isn’t Ollama stateless and therefore supposed to receive the full
context with every request? - If I need to set Ollama to track the conversation state and just send the new messages, how do I do it?
- How can I ensure that the personality and directives stay fixed while only the conversation part gets trimmed?
- Am I missing something in my setup or approach?
For reference, here’s how I’m starting my Ollama instance:
docker run -it --rm --gpus=all -v /LLM/model/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull llama3.1:8b