I have tasks that can be completed in ~3 mins. Cloud Run offers 600 seconds as a maximum timeout, so this is not an issue. We decided to use Cloud Run + Pub/Sub with push subscription.
Workers (I have 4 Cloud Run instances due to 3rd-party library rate limits) have an endpoint where pub/sub can send messages through push subscription. However, we are seeing issues when many messages have suddenly been published. Imagine this scenario:
- 100 messages are suddenly published.
- 4 workers has started to run with 4 messages.
The rest of the messages (96 messages) got 429/500 as we don’t have more workers to process these messages (“The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/run/docs/troubleshooting#abort-request”). Depending on the retry min/max values, they will be pushed to the workers. Let’s assume that
- The retry time is 4 minutes.
- Each message has 5 retries before they are removed from the topic.
After 4 minutes, 92 messages got their second retry errors. Depending on how many retries you set under push subscription settings, the messages will eventually be out of the published topic. In the scenario I am describing, we need 75 minutes to process all the messages; however, within 20 minutes, there won’t be any messages in the topic to be processed.
The maximum wait time before retrying is 10 minutes, and the maximum number of delivery attempts is 100. In my scenario, there is no way to process messages on the topic if the size is more than 300.
So, what I need is flow control — only pull subscription has flow control (more here). Push subscription has push-backoff, which is not useful for tasks that need more than 60 seconds to complete.
I am a bit confused here.
- I’d like to use Cloud Run for scalability + cost efficiency.
- I don’t want to do crazy implementations like having two subscriptions – one is the push subscription for waking up the cloud run instances, and the second is inside the cloud run where I can control the load. This is not a good solution for many reasons, but it is also not useful as the total timeout is only 10 minutes (in my case, I can only process 3 messages). I believe this solution can be more fruitful if I switch to Cloud Tasks – at least, I can process so many more messages.
Another option is to spin up more workers with my own script by understanding the message load, but I feel like I am going to a fist fight with a bazooka.
What is the strategy for this kind of scenario? I thought this should be quite an easy task for cloud run + pub/sub, but I am a bit disappointed. Am I missing something here?