I am designing a system where each user will have its own LLM. When a client logs into the system, an unused worker will load a model and answer all chat requests for the client during the session. During this session of back and forth with the client, the worker is pinned to this client and should not receive any other requests.
Now after the user logs out, the session ends and I want to unload the model and put that worker back in the general pool such that it wait until the next client logs in and the cycle will repeat. Another condideration is scalibility, whatever solution I use it will need to scale to 1000’s of worker processes/nodes.
How do I accomplish this?
I am using python 3.10, rabbitmq, and celery. I thought about having each worker have it’s own queue, but that seems heavy handed to me. I will have a unique session_id for the user, so could I use it in some way as a routing key? Perhaps after the specific worker loads the LLM, it tags itself with the session_id? The issue I see with this approach is the prefetch of tasks from the queue.