I am hosting a computer vision API on AWS EC2. Every month my API returns some 502/504/521 errors, and after a little bit of investigating it turns out that this is because of some strange spot interruptions.
My API is hosted on EC2 instances, with a mix of on demand and spot to reduce costs (I’m not using a spot fleet). I implemented a lambda function to handle gracefully the spot interruption notices. It works like a charm except when I get 2 spot interruption notices (for 2 different instances) in a short time span.
When there is a single interruption notice, I have 2 minutes to handle the instance termination and shut down gracefully the API on the instance, but when I get 2 interruption notices in a row, the first one works as intended, but for the second instance, the instance is shut down 2 minutes after the FIRST interruption notice, instead of what I would expect, 2 minutes after the second interruption notice.
This is a major issue because in some case, the second instance is shut down 10 seconds after its own interruption warning, leaving not enough time to shut down gracefully the interruption notice.
I could not find anything related to my issue, and it feels really weird. Is it a known issue or could I be missing something key ?