We have this architecture:
queue -> message processor (horizontal scaling) -> RDBMS
Sometimes external systems dump 10k messages onto the queue and the message processor of course dutifully chews through them and dumps them to the db. But the db can’t write as fast as the processor can read/transform. Which leads to high CPU usage on the db, which leads to all sorts of problems.
We have a few options:
-
Slow down the message processor. I don’t like this because it sets a hard limit to how quickly the system can process messages.
-
Scale up the db. Workable, but it’s a fixed cost to what is an intermittent problem.
-
Fix the amount of horizontal scaling the MP can do, e.g. you may scale, but never more than 3 instances. This limits total throughput during low-load times.
-
Set the number of retries on messages to something very high, like 100. Currently at 6.
I wonder if there’s a 4th option: ask the db how swamped it is, hold off on attempting to write until the db is less busy (or put the message back on the queue if a set time passes and the db is still busy). I’m ideally looking for something that can adapt to pressure on the db writer instead of setting values ahead of time and hoping it works. With the less adaptable architecture we end up with messages in the DLQ and that requires manual intervention.
Assume for the sake of argument we can’t/don’t need to change indexes in the database. The queries and indexes have no more optimization to squeeze. The scaling on the db is not something that can be automated.
Has anyone done anything like this?