Relatively new to GCP here. We’ve deployed our production API on Google Cloud Run with 4vCPUs, upto 25 instances, 2 instances minimum. Most of the API is working fine, as with testing, but we found a weird issue while testing the deployment, where a single API call is blocking all requests from all other users.
All of us were sending requests together on the website and reloading pages etc. to simulate heavy workloads. The API was keeping up with it well enough, but we suddenly started getting infinite reloads. Upon inspection, we found that we were getting:
HTTP 504
The request has been terminated because it has reached the maximum request timeout.
This was occurring for all requests, irrespective of their load on the (Postgres Cloud SQL) database, when one specific route is being run. For debugging, we inspected the database logs, but found that the DB was under barely any load, and no locks had been detected. Similarly, the Cloud Run containers stats also showed minimal utilisation.
Admittedly, this is one of our heaviest routes, and takes a 1.5 minutes to run, with multiple external API calls and some data processing. However, these requests are synchronous in nature. The route involves:
-
Scraping 2 websites with BeautifulSoup
-
Using spacy to rank text
-
LLM call with Instruct
-
Spacy ranking of returned response
-
2 more LLM calls
While the route itself is asynchronous, all 5 of the above operations are sequentially processed synchronous calls.
I would expect the request to get allotted to a single thread and that thread to take a long time to return a request. Why, instead, is it seemingly blocking all threads across all instances (there were 2 running at the time)? The backend is built in Python + FastAPI.