I’m authoring a set of ECS tasks using AWS EC2 launch types and Cloud Formation which work for a while and then start to fail.
The EC2 instances have a role allowing access to SSM, ECS, ECR, KMS keys, etc. A lot of the time, I’m able to start the services normally, and the tasks will start, and then fail for some reason (that’s the part I’m still working with the devs on). But the important part is that the EC2 instance is able to pull and run the images.
However, I’ve noticed that after a couple tasks fail, I start getting unable to pull image errors. Inspecting the Docker logs for the ECS agent reveal that it’s getting errors pulling secrets from parameter store (saying invalid parameters – confirmed they do exist), and cannot pull the containers with a “CannotPullContainer” error. Once the EC2 instance gets in this state it won’t seem to run anything unless I start a new instance.
If I terminate the ECS instance and let the autoscaler launch a new one, I can then launch containers again (with no changes to the cloudformation template or images themselves), until several fail and them I’m back in the same position.
I’ve tried various IAM policies without effect; but when it’s working, it works, so I don’t believe that it’s directly an IAM issue.
It almost feels like the instance is losing it’s IAM role or something and isn’t able to access the ECR registry or parameter store. Does anyone have any idea of things I could try to troubleshoot this? The docker logs don’t expose much beyond one minute it works the next it doesn’t.