TL; DR: Service running in cloud run is abruptly killed, can’t find any evidence to why.
Disclosure: I’m still familiarizing myself with GCP, so I might be missing something obvious.
I have a Cloud Run service running a scrapyd image. Scrapyd is a daemon service for deploying and running Scrapy spiders.
When running, the instances mount a GS bucket to persist storage. (Not sure if relevant)
Everything works, including project deployment. The problem is when a scrapy job is running: Sometimes the running job gets terminated. Without any explanation or apparent reason. For example the last 3 test jobs I executed, with the same project and same spider:
- The first got terminated after 15min01sec
- The second ran for over 16 HOURS and I manually terminated.
- The third got terminated after 1hour 58min 03sec.
None of the hardware limits were hit, here are the metrics for the third job (between 1pm and 3pm)
In the GCP logs there is nothing that indicates why the process got terminated. No warning or errors, it doesn’t even acknowledges that the job was killed. The scrapyd process is unaffected, and continues to run. The job logs terminate abruptly like this (last line of the log):
2024-07-26 17:59:20 [scrapy.extensions.logstats] INFO: Crawled 409 pages (at 4 pages/min), scraped 3783 items (at 4 items/min)
If you are not familiar with scrapy:
- When a job ends, it logs the job stats, including if it receives SIGTERM.
- If scrapy receives SIGTERM more than once it forcefully shutdown, but logs it first. E.g:
2024-07-26 14:54:22 [scrapy.crawler] INFO: Received SIGTERM twice, forcing unclean shutdown
(from example 2)
Relevant notes
- The job termination is so abrupt that I can only assume that the service got a SIGKILL. I don’t know why though.
- I couldn’t reproduce this problem outside Cloud Run, which leads me to believe that this is somehow related to how Cloud Run operates.
- If you notice in the image the “Container Instance Count” a second container is spun up and down three times during the job. I don’t know why, or if it is in anyway related. A second container would only include the scrapyd service, and AFAIK wouldn’t affect the job in anyway, as it is not distributed.
My questions are:
- What could be the cause?
- Where else can I look for insight beyond metrics and Logs Explorer?