The Setup
I have two Python GCP Cloud Run Gen2 functions. One is an orchestrator function that receives the initial http request, breaks up the request into smaller chunks and invokes the second function for each of the chunks (15 chunks for each request resulting in 15 requests to the second function for each 1 request to the orchestrator). Each instance of the second function is responsible for downloading and processing a file of around 250-400MB, resulting in a response of 100-300KB.
The reason for this architecture is that network bandwidth is limited on a per-instance basis, therefore downloading all 15 files concurrently from one instance is considerably slower, hence the delegation of downloads to multiple instances of a separate function.
The orchestrator function has 1/6 cpu, 512MB memory and a concurrency of 1. The download & process function has 1 cpu, 1GB memory and a concurrency of 1. Both functions sit in us-central1. The functions are written and deployed using the firebase wrapper for google cloud functions.
The Issue
According to logs from the Cloud Console, each instance of the download & process usually completes its tasks in anywhere between 5-20 seconds.
However, on the receiving end in the orchestrator function, which made those requests to the download & process function, I am seeing those same requests logged as often taking 1.5-2 times as long.
Seeing how I am making only 15 requests, I would expect there not to be such a large discrepancy between the supposed response times that are logged within the download & process function (plus some leeway for the round trip) and the actual response times.
The considerations
On the orchestrator function my cpu usage sits at 10-25% and memory utilization at ~40%.
On the download & process function my cpu usage sits at 50-55% and memory utilization at 45-100%. I know 100% sounds rough and could be an issue someone could point at as being the cause – I want to warn you that the unexpected behavior described in the section above happens consistently even with requests that only utilize 50% of memory allocation.
As evident by the logs within the download & process function, all 15 requests arrive at virtually the same time, so the issue is not the orchestrator function throttling sending of those requests.
What I have tried
Bumping resources
I have tried doubling, quadrupling and octupling resources available to instances of each of the functions and did not see any improvements in regards to the issue described.
Refactoring my code
I am using Python’s asyncio
and aiohttp
to make the concurrent requests from the orchestrator function to the download & process function. I have tried different setups of making those requests: one ClientSession
for all of the requests (1 client session in total), one ClientSession
for each of the requests (15 client session in total) and one ClientSession
for each 3 requests (5 client sessions in total). No approach has provided a significant benefit over any of the other ones. Currently I am using the 5 total client session setup.
I have also tried increasing the read_bufsize
of each ClientSession
from the default 64KB to 512KB and 1MB so that the client sessions can read the response body in one go. I am able to do so because memory of the orchestrator function is not a bottleneck since it is only seeing a usage of ~40%.
Optimizing response sizes
Initially, the process & download function would send responses in JSON. These responses would be 3-4 times the size of the responses the process & download function sends right now. This was achieved by two efforts. First – optimizing the data sent in the responses – using ints instead of floats, which cut down the response sizes by a factor of 1.2-1.5. Second – switching from JSON to Protobuf, which cut down the response sizes by a factor of 2-3 further.
Using other languages
I have tried rewriting the orchestrator function in two languages. First – in Rust that was compiled to Python bindings – which did not help. Second – in pure Golang, to see if perhaps the issue lied with Python’s performance – still did not help.
Testing locally
I have tried testing the orchestrator function locally, meaning the instance would be sitting on my machine making requests to the download & process function sitting in the google cloud. This setup has shown the same issue, which makes me want to rule out the cause being slow ingestion of responses by the orchestrator function instance on the cloud, since whatever resource or bandwidth limitations it would have had on the cloud, would not be the same on my local machine.
Conclusion
I am at a defeat with this issue and have no ideas on how to trace it further. Due to all the things I have tried, I am currently leaning to suspecting that this may be some peculiarity of how GCP Cloud Run Functions send responses after the instances process them? Perhaps there is some sort of queue that throttles sending the responses from the 15 instances and that is why I am seeing the first requests’ responses come as expected and then some come with the aforementioned delay?
Any input would be greatly appreciated as I am grasping at straws at the moment.