I am using google’s pub/sub to move processing of some tasks (represented by messages on a pub/sub topic) to the background. Some tasks are expected to fail periodically due to known transient errors from a internal service that enforces some measure/minute type limit. This measure is not known or correlated to tasks otherwise it would be much more straightforward to manage. These transient errors are retry-able. Google’s documentation seems to indicate pull subscribers have more flow control than push subscribers.
How do I process such a queue while having both max throughput while being resilient to transient failures? Ex: I do not want to create my own bottleneck such as to process 1 task at a time vs. to process 100 tasks at once since I may know doing 1 task at a time will for the most part never trigger a transient error.
This is what I’m currently trying:
- List known transient errors let’s say
{A, B}
. - Pull next task
T1
and attempt to process it. - If task
T1
returns errorA
, I stop processing the queue of tasks since I know others will fail with the same timeout-based error. If taskT1
returns an errorB ∉ {A, B}
, send it somewhere else for manual triage. If taskT1
succeeds, pull next task. - Force a time-delay on task
T1
- Try to process
T1
after time-delay and if it works, continue processing the queue
I think at step 2, I can optimize by pulling/dequeueing more tasks instead of 1 at a time. Should be possible to maintain a queue in the code and push tasks back on failure etc.
Also thinking of how multiple copies of such a microservice can maintain shared state such as when copies of the microservice M1, M2, M3
have been spawned and are running, if M1
receives a timeout-based error, M2
and M3
also stop processing since they share an internal service that limits as measure/minute and all three copies share a single access token.