I’m developing FastAPI endpoints that rely on calls to OpenAI’s chat completion. Some of the calls use asyncio.gather to make concurrent requests to OpenAI. Because of this, I’m worried about rate limits. I’m currently using Semaphore to limit the requests per minute, but I’m not sure how to handle tokens per minute.
I’ve thought about creating a global variable that holds the token capacity, but I don’t think that would work since the app will utilize multiple workers and I’m worried about race conditions. Are there any design patterns that could help my case? (FYI I don’t want to have to use a database to hold the current token capacity.)